Skip to main content

Full text of "Educational measurement"

See other formats


handle this volume 

with care. 

The University of Connecticut 
Libraries, Storrs 


3 1153 00a3bT3T 7 

Digitized by the Internet Archive 

in 2011 with funding from 

LYRASIS members and Sloan Foundation 



Sponsored by the American Council on Education 

Dorothy C. Adkins 

University of North Carolina 

Walter W. Cook 

University of Minnesota 

Edward E. Cureton 
University of Tennessee 

Frederick B. Davis 

Hunter College 

John C. Flanagan 

University of Pittsburgh 

Irving Lorge 

Teachers College 
Columbia University 


University of Minnesota 

Charles I. Mosier 

Of^ce of the Adjutant General 
Departmef7t of the Army 

Phillip J. Rulon 

Harvard University 

Donald J. Shank 

Institute of International Education 
K. W. Vaughn 

Formerly ivith Cooperative 

Test Service 

Ben D. Wood 

Columbia University 


















The preparation of this manuscript and its publication 

were made possible by the grant of funds by 

The Grant Foundation, Inc., New York 


Printed in the United States of America 
BY George Banta Publishing Company, Menasha, Wisconsin 


It is well known that the measurement of individual ability, 
achievements, and characteristics offers the most sohd basis on which stu- 
dents may be assisted in their choice of studies and occupations. Although 
individual measurement was once regarded with natural suspicion, re- 
search in this field has made such rapid progress as now to command the 
respect and confidence of personnel officers both in schools and colleges, on 
the one hand, and in industry, on the other. The movement may, indeed, 
now be regarded as having established itself as the chief source of in- 
formation on which educational and personnel officers may rely to aid them 
in their processes of selection and guidance of individuals. 

In the belief that the field of individual measurement offered great pos- 
sibilities for improving educational processes, the American Council on 
Education has long given continuous and close attention to various aspects 
of the movement. For many years it sponsored the American Council 
Psychological Examination. The Council founded the Cooperative Test 
Service and the National Teacher Examinations, as well as numerous other 
enterprises in the testing field. 

The Council has now turned over the operation of these activities to the 
Educational Testing Service, which in 1948 it helped to establish jointly 
with the College Entrance Examination Board and the Carnegie Founda- 
tion for the Advancement of Teaching. 

The Council retains full liberty, however, to carry on studies concerning 
the areas in which there is special need for new instruments of evaluation. 
Indeed, continuous appraisal of the effectiveness and usefulness of testing 
instruments should remain primarily the responsibility of educational or- 

Certainly, developments in the testing field are both so extensive and 
complex as to justify and make necessary comprehensive periodical reviews 
in order that we may know where we are. For this reason it is particularly 
appropriate that the Council's Committee on Measurement and Guidance 
should have planned the general character and manner of executing the 
publication of the present book on educational measurement. 

It is a pleasure as well as a privilege to make acknowledgment to The 
Grant Foundation for the subsidy which made this project possible, and to 
Mr. W. T. Grant personally for his understanding of and interest in the 
study of the individual. Mr. Grant's interest in extending the scope of per- 
sonality analysis has been a gratifying encouragement to the American 


Council on Education and to everyone who lias collaborated in the produc- 
tion of this volume. 

It should be pointed out that the subsidy was used exclusively to bear 
the clerical and incidental costs in the preparation of the manuscript (in- 
cluding the expenses of the planning and editorial conferences) and the 
costs of publication. The work of the writers, collaborators, and the editor 
was contributed without compensation as a professional service. No royal- 
ties to individuals are to be paid out of the proceeds from the sale of the 
volume. All income from sales is to remain in a permanent Measurement 
Book Fund of the American Council on Education, to be spent on future 
revisions of the volume and on the preparation and publication of supple- 
mentary brochures on special topics. 

The Council is deeply indebted to each and every person who has taken 
the time to contribute to the planning and completion of this volume, but 
especially to Dr. E. F. Lindquist for the sound scholarship, patient de- 
votion, and broad point of view which he has brought to his duties as 
editor. I believe that the volume will immediately establish itself as a land- 
mark in the development of the testing movement. 

George F. Zook, President 
American Council on Education 
July 1950 


There has long been an urgent need, particularly in the 
advanced training of measurement workers, for a comprehensive hand- 
book and textbook on the theory and technique of educational measure- 
ment. At the time that this volume was planned, in 1945, no book had 
yet been published that would even begin to fill this need. Anyone then 
attempting to offer a course in the theory and technique of educational 
measurement at the advanced graduate level found it almost impossible to 
provide the student with adequate reference and instructional materials. 
Much that is of real value had been written and published in this field, but 
most of it was virtually inaccessible to students in general and particularly 
to field workers. Most of it had appeared in articles which were very 
widely scattered in a large number of different periodicals over a period 
of many years, while the rest had appeared in special bulletins of various 
testing agencies, in committee reports and in reports of conference pro- 
ceedings, in test manuals, and in other sources that had never obtained 
general distribution and were often unavailable even in the best of uni- 
versity libraries. 

Not only was most of what had been written not readily accessible to 
students and field workers, but much of what needed to be made avail- 
able to them had never been written up and published at all. Most of the 
published articles were concerned v/Ith specific contributions to measure- 
ment theory, particularly the mathematical theory, or with some newly 
developed technique, or with a specific research study. There were few, 
if any, articles concerned with what might be described as the art of test 
construction or of item writing, essays on the underlying philosophy of 
educational measurement, articles in which the writer attempted to sum- 
marize, evaluate, or elucidate the contributions of others, or articles in 
which he tried to pass on to others the detailed knowledge, the "tricks of 
the trade," and the improved sense of values that he had acquired through 
practical experience. Judging by past production, this was a type of writ- 
ing which would be done only in the preparation of a textbook, or on 
definite assignment as a professional obligation. 

Primarily, perhaps, as a result of this lack of reference and instructional 
materials, there were in 1945 very few educational institutions in which 
any systematic courses in educational measurement were being offered at 
an advanced graduate level; and in the few university courses of this char- 
acter which were being offered, the instruction undoubtedly fell far short 



of its possible effectiveness, for the reasons given. It appeared, tiierefore, 
that until a comprehensive and teachable book in this field was made avail- 
able, not only would the improvement of educational measurement practice 
in general be retarded because of inadequate training facilities, but there 
was danger also that much of what already had been learned would be lost 
through failure to record it, and would have to be rediscovered and re- 
learned through experience by individual workers. 

One educational agency that long had been aware of this situation and 
anxious to do something about it was the American Council on Educa- 
tion. The Council's standing Committee on Measurement and Guidance 
had earlier (1936) sponsored a more elementary book of this character, 
The Construction and Use of Achievement Examinations , edited by 
Hawkes, Lindquist, and Mann, and had since periodically reviewed the 
need for a similar book at a much more advanced and more technical level. 
The suggestion was made that the best way for the committee to get this 
job done was to induce some one individual singlehandedly to prepare the 
kind of volume needed, and to provide him with the best possible facili- 
ties for doing the job. It was contended that only in this manner could a 
really well-integrated and effective treatment be prepared. The most telling 
reply to this suggestion was, of course, that in over twenty-five years of the 
objective testing movement, no one had yet tackled this job or announced 
his intention of doing so. Because of the ever-increasing degree of spe- 
cialization within the whole field of measurement, the chances seemed 
more remote than ever that the volume needed would be produced in 
this way. Aside from any other possible disqualifications, no one individual 
had had sufficiently varied practical experience to write authoritatively in 
all of the various areas of specialization, even if he had the courage and 
time to tackle so large an assignment. It seemed clear to the committee, 
therefore, that the kind of book needed could be prepared only through 
the collaboration of a large number of specialists, each writing in the area 
of his own special competence. 

Accordingly, as soon as the end of World War II permitted, the Meas- 
urement and Guidance Committee^ authorized the calling of a special con- 
ference to plan and initiate a project of the general character just sug- 
gested. This conference was held in Williamsburg, Virginia, April 17-19, 
1945. The members of the planning conference were W. W. Cook, John 
Flanagan, Irving Lorge, T. R. McConnell, Phillip Rulon, Donald J. 
Shank, K. W. Vaughn, Ben D. Wood, and E. F. Lindquist, chairman. 

* The membership of the committee at this time consisted of: Sarah G. Blanding, Galen 
Jones, E. F. Lindquist, Malcolm Price, E. G. Williamson, Donald J. Shank, Francis L. Bacon, 
C. L. Cushman, Eugene R. Smith, and George F. Zook. 


At this conference the purpose and scope of the projected volume were 
determined, a tentative table of contents was prepared, writers and col- 
laborators for the various chapters were nominated, and a general edi- 
torial policy was set. 

In selecting the contributors to the volume, the planning group felt that 
no one individual alone should be entrusted with the preparation of any 
single chapter, but that different points of view and types of experience 
should be represented in every case. It was hoped that each chapter might 
represent the joint product of the half-dozen or so persons generally recog- 
nized, as the most competent and experienced thinkers in the area involved, 
that one of these would take primary responsibility for writing the chap- 
ter, and that the others would serve as collaborators, whose duty it would 
be to review, criticize, and revise what he had written so as to make it 
generally acceptable. A total of approximately seventy measurement workers 
were nominated for participation — twenty as chapter writers and the rest 
as collaborators. 

The plans developed at the Williamsburg conference were quickly ap- 
proved by the Measurement and Guidance Committee and by the Ameri- 
can Council on Education, the editor of the projected volume was au- 
thorized to proceed at once with execution of these plans, and a generous 
grant of $20,000 was later obtained by the American Council on Educa- 
tion from The Grant Foundation to pay the expenses of the project. 

The project was initially very well received — practically everyone who 
was invited to participate agreed to do so — but the unusual circumstances 
existing in the universities following the war prevented many of the con- 
tributors from completing their assignmeats until long past the scheduled 
time. The original plans had called for completion of the entire project 
in less than two years, but almost four years passed before first drafts of 
all of the chapters were in the hands of the editor.; 

A large share of the editorial work involved was done in a special edi- 
torial "work conference" held at Gloucester, Massachusetts, from August 
14-27, 1949. The members of this conference were Dorothy Adkins, 
W. W. Cook, E. E. Cureton, Frederick B. Davis, John Flanagan, Irving 
Lorge, Charles Mosier, and E. F. Lindquist. Very special acknowledg- 
ments are due the members of this conference. For two full weeks, work- 
ing in editorial teams of two or three per chapter, they collectively read 
carefully, criticized, and revised more than 1,100 pages of manuscript, and 
prepared specific suggestions to the writers for further revisions. In several 
instances, their efforts resulted in major improvements in the original drafts, 
and many minor inconsistencies, errors, and infelicities were eliminated. 
Following the Gloucester meeting, furthermore, several members of the 


conference contributed a considerable amount of their time in further criti- 
cism and revision of the manuscript. 

Except in the case of two chapters, the members of the editorial con- 
ference and the chapter writers were able to reach essentially full agree- 
ment on all important issues considered. In the published versions of these 
two chapters, opinions are expressed to which the editor and certain 
members of the editorial conference had objected rather strenuously, but 
to no avail. This statement is made here, not in criticism of the writers 
involved, but to explain certain inconsistencies among chapters, and to 
warn the reader of this volume that he should not assume that authorities 
are fully agreed on all ideas expressed therein. 

The original plans provided for a book of four major parts. The first 
three of these constitute the present volume. The fourth part was to have 
contained chapters on Local Testing Programs and Regional Testing Pro- 
grams, by Max D. Engelhart and E. F. Lindquist, respectively. These 
chapters were completed early, but had to be omitted from the published 
volume to keep its size within practicable limits. It is hoped that these 
chapters may be published later as supplementary brochures. 

Major acknowledgment is due to Dr. Ben D. Wood. As director of the 
Cooperative Test Service from its beginning until 1940, Dr. Wood was an 
ex officio member of the Measurement and Guidance Committee of the 
American Council on Education, and it was primarily at his suggestion and 
insistence that the committee sponsored both the present volume and the 
earlier volume by Hawkes, Lindquist, and Mann. It was largely through 
the efforts and intermediation of Dr. Wood, furthermore, that funds were 
obtained from The Grant Foundation to make this project possible. Cer- 
tainly it was primarily because of Dr. Wood's personal influence and en- 
couragement that the editor had the temerity to tackle this project and the 
determination to see it through to completion. This volume might well be 
regarded as a monument to Dr. Wood's outstanding leadership in educa- 
tional measurement in this country during the past three decades. 

It is obviously impracticable to attempt here to acknowledge individually 
and adequately all of the many other persons who helped make this volume 
possible. The names of authors and collaborators are listed at the be- 
ginning of the chapters. Everyone concerned — collaborators, conference and 
committee members, and others — cooperated fully and wholeheartedly in 
a fine professional spirit, and the editor wishes to express here his sincere 
personal appreciation of their very generous cooperation. 

As stated in Dr. Zook's Foreword, the American Council on Education has 
established a permanent Measurement Book Project Fund, into which will 
be paid the proceeds from the sale of this volume. This fund will be used 


for future revisions of this volume, and for a series of brocliures dealing 
with special measurement problems which are not of sufficiently general 
interest to be considered in the present volume. Readers and users of this 
volume are earnestly requested to cooperate in this continuing project by 
submitting criticisms and suggestions for revisions of this volume as well 
as suggested topics for special brochures. It is hoped that with the coopera- 
tion of the members of the profession, this project may render increasingly 
valuable service in the training of measurement workers and in the im- 
provement of educational measurement practices in general. 


Iowa City, Iowa 
September 1950 


Foreword by George F. Zook v 

Editor's Preface vii 

List of Figures xv 

List of Tables , xix 

Part One 

1. The Functions of Measurement in the Facilitation of Learn- 

ing 3 

By Walter W. Cook 

2. The Functions of Measurement in Improving Instruction ... 47 
By Ralph W. Tyler 

3. The Functions of Measurement in Counseling 68 

By ]ohn G. Darley and Gordon V. Anderson 

4. The Functions of Measurement in Educational Placement . . 85 
By Henry Chauncey and Norman Frederiksen 

Part Two 

5. Preliminary Considerations in Objective Test Construction . 119 
By E. F. Lindquist 

6. Planning the Objective Test 159 

By K. W. Vaughn 

7. Writing the Test Item 185 

By Robert L. Ebel 

8. The Experimental Tryout of Test Materials 250 

By Herbert S. Conrad 

9. Item Selection Techniques 266 

By Frederick B. Davis 

10. Administering and Scoring the Objective Test 329 

By Arthur E. Traxler 



11. Reproducing the Test 417 

By Geraldine Spaulding 

12. Performance Tests of Educational Achievement 455 

By David G. Ryans and Normati Frederiksen 

13. The Essay Type of Examination 495 

By John M. Sialnaker 

Part Three 

14. The Fundamental Nature of Measurement 533 

By Irving Lorge 

15. Reliability 560 

By Robert L. Thorndike 

16. Validity 621 

By Edward E. Cureton 

17. Units, Scores, and Norms 695 

By John C. Flanagan 

18. Batteries and Profiles 764 

By Charles I. Mosier 

Index 811 


1. Variability in Clironological, Mental, and Achievement Ages of Pupils 

in Two Eight-Grade Elementary Schools 12 

Continued 13 

2. Picture Items May Measure Ability To Use Instruments, Read Maps, 

etc 202 

3. Picture Test Item Containing Groups of Objects To Show Relationship 203 

4. Picture Test Item Showing Right and Wrong Procedures 203 

5. Item Data Card {front) 321 

{Reverse side) 322 

6. Example of Test Report Sheet To Be Signed by Room Examiner .... 341 

7. Example of Schedule and Time Record To Be Signed by Room Ex- 
aminer Administering National Teacher Examinations 342 

8. Directions for Administering the Cooperative Tests 355 

9. Example of Fan or Accordion Key Used in Hand Scoring 373 

10. Example of Plain Punched Key for Manual Scoring 375 

11. Portion of Answer Sheet for Cooperative English Test A, Form S . . . . 378 

12. Answer Sheets Are Placed in Scoring Frame 380 

13. Scoring with Stencil Key 381 

14. Answer Sheet of Toops Scoring Pad, Form 20 387 

{Reverse side) 388 

15. Record Form for Item Counting 395 

16. Sample Aggregate- Weighting Sheet from Manual of Instruction for 

the IBM Test Scoring Machine 397 

17. Portion of Test and Key Used in the Selection of Scorers of Semi- 
objective Tests at the Educational Records Bureau 403 

18. Specimen Classification Slip for Recording Routine of Handling the 
Scoring of a Test 406 

19. Sample of Summary Chart Collating Results from All Schools Par- 
ticipating in Iowa Tests of Educational Development 409 

20. Facsimile of Original Test Item, Illustrating Photo-Offset Reproduc- 
tion of a Photograph and of Typewritten Material Facing page 419 



21. Reading Passage Printed with Lines Too Long for Easy Legibility . . 424 

22. Use of Very Short Lines Breaks Up the Text Excessively 425 

23. Space Wasted on Lines Containing Choices 427 

24. Double-Column Page Permits More Compact Arrangement of Same 

Material 427 

25. Item Is Difficult To Read in Strung-out Form 430 

26. Types of Items for Which Horizontal Arrangement Is Suitable .... 430 

27. Illustration of Various Ways of Arranging Multiple-Choice Items ... 431 

28. Diagram Interrupts Verbal Sequence from Stem to Choices 432 

29. Stem and Choices Kept Together 432 

30. Compact Arrangement, Useful When Economy of Space Is Especially 
Important 433 

31. Item Elements Differentiated by Use of Varying Amounts of Space 
between Lines 434 

32. Illustrations of Ways of Setting off Choice Numbers 435 

33. Answers Arranged in Order of Magnitude, but Likely To Cause Con- 
fusion between the Answer Proper and the Identification Number . . 436 

34. A Better Arrangement, Avoiding Confusion 436 

35. Illustrations of Ways of Arranging True-False Items 437 

36. Illustration of Page Arrangement with Reference Material (Cooperative 
Test of Social Studies Abilities) 438 

37. Illustration of Page Arrangement with Reference Material (Cooperative 
General Chemistry Test) 439 

38. Illustration of Page Arrangement with Reference Material (Cooperative 
Historical Geology Test) 441 

39. Illustration of Arrangement of Grouped Items 442 

40. Diagram Showing How To Determine Typing Area for Offset Copy 443 

41. Illustration of Use of Guide Lines To Insure Correct Placement of Patch 

on Typed Copy 447 

42. Illustration of Different Sizes of Type 449 

43. Illustration of Text Set Solid 449 

44. Illustration of Text Set with 2-Point Leading 450 

45. Wisconsin Miniature Test for Engine-Lathe Operation 459 

46. Miniature Punch Press Test 460 


47. Vigilance Test Used for Measuring Operations and Reactions of Auto- 
mobile Drivers 461 

48. Blum Sewing Machine Test 463 

49. A Rough Point-Scale for Judging Ability To Saw to a Line with Rip 

and Cross-cut Saw 474 

50. Two Gauges Employed for Rating Shop work in Basic Engineering 
Schools of the U.S. Navy 475 

51. Simple Device for Revealing "Wind" or Unevenness of a Flat Surface 476 

52. Dimension Meter for Testing Mechanical Ability 476 

53. Squareness Machine for Testmg Mechanical Ability 476 

54. Assembly of Bolt, Nut, and Washer, Illustrating Use of Code Numbers 

in Scoring Performance Tests 477 

55. Score Card Used in Rating the Cooking of Bacon 478 

56. Point-Scale Rating Form for "Fastening" in Woodworking 479 

57. Ayres Handwriting Scale 480 

58. Graded Sample Quality Scale for Judging the Excellence of Western 
Union Splices Made by Electricians 480 

59. Diagram Illustrating the Method of Coordinating the Administration 

of Identification, Performance, and Written Tests 488 

60. A Classification of Scales of Measurement (S. S. Stevens) 552 

61. Comparison of Traditional Grade Equivalent Norm Lines for Selected 
Subtests of the Metropolitan Series 708 

62. Distribution of Raw Scores on Arithmetic Test 710 

63. Distribution of Raw Scores on Reading Comprehension Test 710 

64. Illustrative Use of Arithmetic Probability Paper in Normalizing Dis- 
tribution of Scores in Table 12 729 

65. Normalized Curves for Two Groups 735 

66. Possible Forms of Raw-Score Distributions for Different Schools . . . 739 

67. Illustration of Method of Equating Scores by Equi-Percentile Curves . . 755 

68. An Illustration of the Procedure for Obtaining Equivalent Scores by 

the Equal Proportion Method 757 

69. Diagram Illustrating the Hypothetical Relationship between Driving 
Skill and Visual Acuity over the Entire Range 785 

70. Hypothetical Profile Showing Error Involved in Connecting Profile 
Points 796 


71. Two Types of Profiles Not Involving Connected Profile Points .... 796 

72. Hypothetical Profile of a Superior Student in the Eighth Grade 798 

73. Two Arrangements of Same Profile Showing Possible Configurational 
Errors 799 

74. Schematic Profile Omitting All Interpretational Data 801 

75. Hypothetical Profiles Representing Mean Scores for Tests A, B, and 
C, for Three Occupational Groups — File Clerks, Law School Students, 

and Engineers 802 

76. Sample Profile of Individuals Superimposed on Group Profiles 802 

77. Profile Illustrating Comparison of Individual with Successful and 
Unsuccessful Groups 803 


1. Mental Age Range by School Grade and Chronological Age 10 

2. Validity Coefficients of the C.E.E.B. Scholastic Aptitude Test for 
Harvard Students 90 

3. Correlations of Averaged C.E.E.B. Achievement Test Scores with 
Harvard Freshman Grades 92 

4. Correlations of C.E.E.B. Subject-Field Achievement Test Scores with 
Harvard and Yale Freshman Grades 92 

5. Number and Percent of Items in Each Category of the 1948 Premedical 
Science Achievement Test 162 

6. Values of Chi at Various Significance Levels for Certain Sample Sizes 290 

7. Item Analysis Data for a Test Item before and after Revision 307 

8. Possible Sources of Variance in Score on a Particular Test 568 

9. Equivalent Formulas for Estimating Reliability from Half-Length 
Tests 581 

10. Relationship between Reliability Coefficient and Standard Error of 
Measurement 610 

11. Distribution of Pintner IQ's for Modal Age Group, Metropolitan Na- 
tional Standardization 718 

12. Distribution of Scores on a Vocabulary Test for 515 Seventh-Grade 
Pupils 730 

13. Distribution of Scores and Equi-Percentile Points for Two Forms of 

a Science Test for Matched Samples of Tenth-Grade Pupils 754 

14. Reliability of Difference Scores in Terms of the Correlation between 

the Scores and Mean Reliability of the Scores 777 

15. Problems and Sample Techniques for the Linear Combination of 
Measures 790 

Part One 


I. The Functions of Measurement in the 
Facilitation of Learning 

By Walter W. Cook 

University of Minnesota 

Collaborators: William A. McCall, Teachers College, Columbia Uni- 
versity; Herschel T. Manuel, University of Texas ; Jacob S. Orleans, The 
City College of New York; Ralph W, Tyler, University of Chicago 

Instruments of educational measurement are simply the means 
by which quantitative aspects of human behavior are observed vv^ith greater 
accuracy. To the extent that such instruments conform to the principles of 
quantitative logic, it becomes possible to know with greater exactness the 
relationships among the various aspects of educational procedure, the apti- 
tudes of learners, and changes in human behavior. The purpose of this is 
to make possible more accurate prediction and control in the educational 
process. The value of measurement depends upon the extent to which the 
relationships established are crucial from the social point of view. The 
central questions are: What changes in behavior are desirable.'' How can 
these changes be measured.'' What aptitudes are essential to the develop- 
ment of a given form and level of behavior.'' What are the crucial elements 
in the educative process? The value of educational measurement depends 
upon the validity of the answers to these questions. 

Although this book is primarily concerned with the more technical aspects 
of educational measurement and test construction, it is desirable to give 
early attention to functions, since instruments are designed and evaluated 
in terms of their functions. It should be recognized, however, that to state 
the functions of measurement in other than general terms is somewhat pre- 
sumptuous. The specific uses of an educational measuring device are limited 
largely by the ingenuity and insight of the designer and user. To state the 
uses in detail is beneficial to the student, but he should recognize that it is 
the educational sophistication of the writer that is being revealed, not the 
limitations of educational measurement. As in all science, advanced instru- 
ments suggest new uses, and new uses stimulate the creation of better- 
designed instruments. 

rTJie-central problem of all educational endeavor is learning.jFrom the 
determination of national and state educational policies to the selection of 



the type of broom the janitor will use in the local school, the final criterion 
is the facilitation of learning. In the educational sense, learning is the process 
of changing human behavior in socially desirable directions. In this broad 
interpretation, all the functions of educational measurement are concerned 
either directly or indirectly with the facilitation of learning. 

This chapter, however, will be limited largely to the functions of measure- 
ment in facilitating learning in the classroom situation. Some attention will 
be given to establishing classroom situations in which measurement can 
function more adequately. 

Since the first four chapters of this volume are concerned with the func- 
tions of measurement, it is desirable that a brief outline of the contents 
of these chapters be presented. 

Outline of Ihe Functions of Measurement in Education 

Over-all Educational Planning 

When social planning, as it relates to education, becomes more forthright 
and deliberate, the role of measurement assumes greater importance in the 
process. For example, during World War II when the maximum utilization 
of the nation's manpower was essential, questions were quickly asked which 
could be answered only through measurement. Some of these were: What 
are the behavior characteristics of successful combat pilots, bomber pilots, 
bombardiers, navigators, and the hundreds of other classifications of spe- 
cialists in the essential military and civilian services? What level of aptitude 
is necessary for the various types of special training? What is the distribu- 
tion of the various talents and combinations of talents in the general popu- 
lation? How can the individual with specialized talents be most quickly and 
reliably identified? How can manpower be most efficiently allocated to the 
various essential services? Experience gained in attempting to answer these 
questions under the stress of an emergency situation indicates that one of 
the most important elements in the preparation for a war emergency would 
be a continuing inventory of manpower during peace. 

If the efficient utilization of manpower is desirable during war, it will 
no doubt someday be accepted as equally important in peace. Measurement 
will answer many of the basic questions in the process. For example, what 
are the behavior characteristics of successful research workers in physics, 
chemistry, mathematics, and the other sciences? What are the behavior 
characteristics of successful surgeons, lawyers, engineers, etc.? In fact, the 
limit of this list today is the 30,000 job titles listed in the Dictionary of 
Occupational Titles. What levels and combinations of aptitudes are es- 


sential to the development of the desired types of behavior in each voca- 
tional area? What is the distribution of each talent and talent combination 
in the general population? How many specialists of the various types are 
needed? How many schools of each type will be necessary to train them? 

How far social and educational planning of this type may go in the 
future we cannot say, but certainly more or less uncoordinated efforts in 
this direction have been in evidence for some time. The report of the Presi- 
dent's Commission on Higher Education,, which emphasizes the social role 
of higher education in our democracy; the attempt of the Office of Scientific 
Personnel of the National Research Council to determine the extent of 
potential scientific research talent in the nation; and the battery of voca- 
tional placement tests developed by the United States Employment Service 
constitute evidence of the trend. 

From the standpoint of the individual school which is concerned with 
selecting students, these problems, which have been presented in a national 
setting, became problems of admission to, and placement within, the in- 
stitution. Measurement has become an increasingly important factor in ad- 
mission policies during the past thirty years. 

Educational Placement 

Education has two over-all functions — the integrative and the dijferentia- 
tive. Integrative education is, within the limits of individual aptitudes, 
designed to make people alike in their ideals, values, loyalties, virtues, lan- 
guage, and general intellectual and social adjustment. It is frequently re- 
ferred to as "common education" or "general education." It unifies and 
gives cohesion to the social group. Differentiative education is designed to 
make people different in their competencies, to prepare them for the pro- 
fessions and specialties. The elementary school is concerned entirely with 
the integrative function; the high school, to a high degree; while at the 
college level general education tends only to supplement professional and 
special education. 

From the standpoint of the use of measurement in educational place- 
ment this distinction is important. In the common schools, which all attend, 
where general education is emphasized, it is the obligation of the school 
to adapt the curriculum to the aptitudes and abilities of the students. In 
the professional schools students are selected in terms of their ability to 
succeed in an established curriculum. 

Since World War I measurement has had an increasingly important place 
in the admission policies of colleges and professional schools and in the 
awarding of scholarships and degrees. This topic will receive detailed at- 


tention in chapter 4 of this volume. The uses of measurement in adapting 
the curriculum to the needs of pupils in the common schools will receive 
attention later in the present chapter. 

Guidance and Counseling 

Whereas the professional schools and colleges in their admission policies 
are concerned with measurement from the standpoint of selecting students 
who will succeed in a given curriculum, the student counseling service is 
concerned with measurement as an aid in helping the individual student 
find the vocation, college curriculum, and general social environment which 
will make for his successful adjustment. The student's intellectual apti- 
tudes, motor aptitudes, vocational interests, personality traits, study skills, 
social skills, and achievement profile are assessed with a view to assisting 
him to make an optimum vocational, educational, and social adjustment. 
This topic will receive attention in chapter 3. 

Improvement of Instruction 

School administrators have long recognized that one of the most effective 
ways of insuring that a given educational objective will be emphasized in 
the classroom is to measure periodically the extent of its realization. Like- 
wise, the teacher has recognized that students will learn most effectively 
those things that are tested. However, it is only within recent years that the 
full power of measurement to modify and improve instructional procedures 
has been realized. The test builder has been the first to recognize the 
necessity for clearly formulated educational objectives described in terms 
of changes in behavior which can be objectively observed through student 
responses to test items. The tests, in turn, have clarified and emphasized the 
refined objectives in the thinking of teachers and students. Educational 
goals have become more definite and meaningful. As a result, the selection 
of content and the organization and nature of learning experiences have 
been appraised and modified in terms of the effectiveness with which 
specific objectives are achieved. Attention has been focused on achieve- 
ment, and educational method has become a means rather than an end. 

A detailed analysis of the functions of measurement in improving the 
goals, content, organization, supervision, and administration of instruc- 
tion is presented in chapter 2. 

Improvement of the Learning Situation 

Measurement is a fundamental tool of educational research. As better 
instruments are designed and constructed, the science of education moves 
forward. What is known regarding the principles of human growth and 


development, the nature and extent of individual and trait differences, the 
learning process, and the dynamics of group behavior is dependent largely 
upon measurement. The facts, principles, and relationships established 
should constantly serve as basic data in the critical examination of prevail- 
ing school organization — its objectives, procedures, and basic assumptions. 

A bowing acquaintance with this fund of knowledge is certainly neces- 
sary to the practical schoolman who proposes to use measurement as a tool 
in increasing the effectiveness of a school. For example, imagine the pos- 
sible reactions of the principal and teachers of an elementary school to the 
scores on a battery of achievement tests administered to the sixth-grade 
pupils when it is found that there is a range of eight years in reading 
ability in this grade and that the typical pupil varies six years in his achieve- 
ment in the different subjects. Or imagine the principal who finds it neces- 
sary to set up a combined fourth and fifth grade. He combines the high 
achievers in the fourth grade with the low achievers in the fifth and insists 
that the teacher follow the prescribed course of study with both groups. 
At the end of the year a battery of tests is given, and the fourth-graders 
show superior achievement on every test. Such test results make sense only 
to individuals who know the elementary principles of child development 
and the basic facts regarding the extent of individual and trait differences. 

It is impossible in a brief treatment of the functions of measurement to 
emphasize adequately its place in educational research and the necessity 
of a broad knowledge of this research for the proper interpretation and 
use of measurement in the school. Some of the most essential information 
for a rudimentary interpretation and use of test results in the practical 
school situation is treated in the present chapter. It is concerned with four 
main topics — the functions of measurement in (1) establishing learning 
situations appropriate to the needs, abilities, and potentialities of the in- 
dividual student; (2) the diagnosis and alleviation of specific learning 
difficulties; (3) the motivation and directing of learning experiences; and 
(4) the development and maintenance of skills and abilities. 

Functions of Measurement in Establishing Individual 
Learning Situations 

The most important characteristic of a favorable learning situation is a 
strong ego-involved drive (purpose) on the part of the learner to acquire 
the various socially approved behavior patterns. To assume that such learn- 
ing can be made an end in itself is perhaps one of the most frequent errors 
in educational thinking. Such learning is really the means of acquiring a 
progressive realization of social status and prestige. What the learner wants 
is social position with security, favorable attention, and recognition of his 


special virtues, abilities, and accomplishments. He wants success in what 
he undertakes and a progressive broadening of significantly related new 
experiences. A good school or a good learning situation is one in which 
these fundamental cravings of the individual are satisfied through educa- 
tional experiences. 

The ten-year-old who gets up an hour early to practice reading a story 
he has volunteered to read to the class, the algebra student who puts in 
overtime on the really tough problems, the student who spends extra hours 
on a special report, the football squad at its grueling practice, and the boy 
who has a paper route to earn money to buy a trumpet, to take lessons for 
six months, to get to play in the band and wear a flashy uniform, are all 
examples of the social prestige factor in the learning situation. The problem 
then is to make learning highly satisfying in this sense. It should not be 
preparation for prestige-getting — it should be prestige-getting. 

Certain elements of the learning situation which are related to the 
prestige goal and which may be improved through the proper use of 
measurement are: 

1. Classification and grouping. The first consideration is that students 
should be classified in such a way that no embarrassment or stigma is 
attached. The second consideration is that the group have common 
learning needs. Group solidarity resulting from common goals, common 
understandings, common efforts, common difficulties, and common 
achievements should characterize the class. 

2. Individual instruction with attention to specific accomplishments and 
deficiencies should be possible. 

3. Objective measures of achievement and progress are important. 

4. The individual student should be given reading materials and problems 
comfortable for his level of ability. Success must be emphasized. An 
optimistic and encouraging attitude should prevail. 

5. The student should know what his specific errors, misunderstandings, 
and shortcomings are. The means of self-appraisal and identification of 
needs should be provided. 

6. The instructor should have an economical method of determining the 
aptitudes, abilities, physical condition, social adjustment, and self-adjust- 
ment of each student. 

Promotion, Grouping, and Related Problems 

With minor qualifications public education in the United States is com- 
mitted to twelve years of schooling for all the children of all the people. 
In the twelfth year as well as the first the potential unskilled laborers, truck 
drivers, and janitors sit beside the embryo research physicists, journalists, 
and surgeons. In most schools they look at the same textbooks, hear the 
same discussions, pursue the same educational goals, and are marked on 


the same standards. At school age the dull and the brilliant look much 
alike, especially to the elementary school teacher who receives from thirty 
to fifty new pupils each year, and the teacher in the high school who meets 
from one hundred fifty to two hundred pupils each day. 

How can educational measurement improve the effectiveness of such 
schools.'' In answering this question, it must be understood that tests are 
tools, the value of which depends upon the educational insight and in- 
genuity of the user. They are potentially as harmful when used improperly 
as they are valuable when used properly. There is no magic in mere use. 

When a new instrument becomes available in any field of endeavor, it 
is natural that its usefulness be estimated in terms of its power to facilitate 
the achievement of the then-accepted objectives, within the prevailing 
forms of organization and procedure. The fact that the results of the use 
of the instrument constantly point to the need for revising objectives and 
changing organization and procedure may go unheeded for some time. The 
influence of measurement in and on the schools has followed this pattern. 
Objective tests were first used simply as examinations had always been 
used. But in recent years it has been recognized that, if educational measure- 
ment is to achieve its maximum value in the improvement of the educa- 
tional process, the results of measurement must be considered basic data 
in a re-examination of prevailing school organization, objectives, pro- 
cedures, and basic assumptions. Primary data for such a re-examination are 
those having to do with the nature and extent of individual and trait dif- 
ferences. A brief review of these data is essential to understanding the 
problem of educational grouping. 

Range of intelligence by age and grade levels 

One of the most carefully selected samples of representative school chil- 
dren available for study is the one used by Terman and Merrill {54) in 
standardizing the 1937 revision of the Stanford-Binet tests of intelligence 
and reported by McNemar {25). Seventeen communities in eleven states 
were sampled. Care was taken to avoid selective factors. All American-born 
children of the white race who were within one month of a birthday were 
tested regardless of grade location. Highly reliable intelligence test scores 
were available for this sample, since both forms (L and M) of the revised 
tests were administered and average scores reported. 

Of the total sample, 2,106 subjects were in grades one to twelve and 
within the age limits six to eighteen. The median chronological age range 
in each grade was six years. Table 1 shows the mental age range (2nd to 
98th percentile) both by chronological age and grade level for this group. 
One may conclude from these and other data presented in this study that 


in a typical school: (1) the first-grade teacher will find that 2 percent of 
the pupils have mental ages of less than four years and that 2 percent will 
have mental ages of more than eight years; (2) the sixth-grade teacher 
will find that 2 percent of the pupils have mental ages of less than eight 
years and that 2 percent will have mental ages of more than sixteen years; 
(3) the high school teacher will find a range of from eight to ten years 
in mental age at each grade level; and (4) these conditions will be found 
to exist whether the school enforces strict policies of promotion and failure 
or promotes entirely on the basis of chronological age. 

Mental Age Range by School Grade and Chronological Age* 


Mental Age Range, 
2nd to 98th 















Mental Age Range, 
2nd to 98th 


• Adapted from Tables 7 and 8, McNemar (f ^i PP- f^T.^^K^ ,, . ^.^ure of the sample. For example, the range 

The riige at the extreme grade and age l^^^f'^/^^f'j'if'i^ ear-olds in ?he first grade and one six-year-old In 

of 2.8 for the s.x-year age group .s bped on a sample of 30 six year oWs m ine ^ g ^^ ^^^ ^^^^^ ^^^^^^ ^^ 

^^r^^l^i^^^^i^^^^^. t^e 2^d»^^ percentile range would approximate four years 

McNemar concludes that since both in 1916. and 1937 Terman and his 
associates found grade levels to vary substantially as much as age levels in 
mental age, "adjustments in the provisions for learning should be made 
without too much departure from normal grade promotions" (23). 

Range of achievement by age and grade levels 

Since the beginning of educational measurement no fact has been more 
frequently revealed, and its implications more commonly ignored, than the 
great variability in the achievement of pupils in the same grade. Anyone 
who has administered a battery of achievement tests to the pupils in a 
graded school has been struck by the great overlapping between grades^ 
An analysis of the norms for any battery of achievement tests will reveal 
the extent of grade variability. 

The variation in achievement based on studies by Lindquist {22), Cook 
[5) and Cornell (8) may be summarized briefly. It follows closely the 


findings on intelligence. The range in achievement (2nd to 98th per- 
centile) at the first-grade level is between three and four years; at the 
fourth-grade level, between five and six years; and at the sixth-grade level, 
between seven and eight years, in the areas of reading comprehension, 
vocabulary, the mechanics of English composition, literary knowledge, 
science, geography, and history. In arithmetic reasoning and computation 
the range is somewhat less, between six and seven years at the sixth- 
grade level. These conditions are found to prevail whether the study is 
based on a large standardization sample representing many schools or on 
the median variability of single classroom groups. 

A graphic portrayal of overlapping in grade achievement of two typical 
eight-grade elementary schools, one with high rate of retardation, the other 
with low rate of retardation, is presented in Figure 1. It was developed 
by Cook (5) in a study of eighteen school systems to determine the effect 
of promotion policies on the mean achievement and variability of ele- 
mentary school classes. 

An examination of Figure 1 reveals that the chronological age distribu- 
tion for the first two grades is almost identical in the two schools, but 
beginning with grade three there is progressively more retardation in one 
of the schools. The classes are most homogeneous with respect to chrono- 
logical and mental age. The mental age range^ is not surprising, but vari- 
ability in achievement is so great as to be almost unbelievable to persons 
inexperienced in administering such tests. Least variability in achievement 
is found in arithmetic, with a range of six years at the fourth-grade level 
extending to eight years at the eighth-grade level, but in English, reading, 
science, geography, and history almost the complete range of achievement 
is found in each grade above the primary. The mean achievement for the 
school with the low retardation rate is significantly higher in most subjects 
than in the school that retains the low-ability pupils longer. The range of 
achievement with which each teacher must cope is not significantly differ- 
ent in the two schools. The most important single generalization that may 
be drawn from a study of this chart is that in the primary grades a teacher 
may expect a range of from four to five years in achievement, while above 
the primary level almost the complete range of elementary school achieve- 
ment is present in every grade. 

It should also be noted that the low achievers in any four or five suc- 

^ Mental age was measured with the Kuhlman-Anderson Intelligence Test. This test tends 
to yield a narrower range of mental ability at age and grade levels than other intelligence 
tests, probably because only 10 tests of a possible 39 in the entire scale are given to any 
group. The test manual cautions against this practice, but not with sufficient emphasis. The 
use of the median mental age on the 10 tests instead of the mean may also tend to reduce 
the variability. Achievement was measured with the Unit Scales of Attainment battery. 



cessive grade groups have more in common than they have with the mean 
achiever of their own grade. Hence, if the low achievers in the eighth grade 
were demoted to the fourth, they would still be low achievers in the fourth 
grade. The same is true regarding high achievers in any several successive 
grades. They resemble each other more in achievement than they do the 
mean pupil of their grade. 

5 6 7 8 9 10 il 12 13 14 15 16 17 



GRADE 3l|.,-~>|--> 


^ GRADE 3 L, =?>-4 


8 9 10 II 12 13 14 15 le 17 

8 9 10 II 12 13 14 IS 16 17 II 


Fig. 1. — Variability in chronological, mental, and achievement ages of pupils in two 
eight-grade elementary schools, one with high rate of retardation, the other with low rate 
of retardation. Frequencies computed as percent of class at each age level. (From Cook 
[3, pp. 28-29}.) 



The policy of retardation and acceleration, which still prevails in many 
schools, is based largely on the assumption that the practice results in grade 
groups which are more homogeneous in achievement than age groups 

5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 

5 6 7 8 9 10 II 12 13 H 15 16 17 IB 19 

5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 

5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 

5 6 7 8 9 10 II 12 13 14 15 15 17 18 19 

1— ^T r 

5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 

5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 

5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 

Fig. 1. — Continued 


would be. Although there are few reports of the variability in achievement 
of unselected age groups, what evidence is available does not support this 

Cornell (8) reports the variability of New York State school children 
in intelligence and achievement at three age levels — seven, ten, and four- 
teen. After comparing these age data with grade data from two previous 
studies by Coxe {9, 10), Cornell concludes that, "both the range of the 
middle 50 percent and the total range show no marked differences in favor 
of either age or grade groups. For practical purposes of classification, then, i 
we could deal with an age group without any more difficulty due to 
diversity than we find in a grade" (8). This conclusion is similar to that 
reached by McNemar in comparing the age and grade ranges in intelligence. 

Variability of high school and college students 
in intelligence and achievement 

The most comprehensive and detailed study of the intelligence and 
achievement of high school and college students available is reported by 
Learned and Wood (i9). It involved from eight to twelve hours of test- 
ing for 26,548 high school seniors, 5,834 college sophomores, and 3,826 
college seniors, all in the state of Pennsylvania. 

Intelligence was measured by the Otis Self-Administering Test of Mental 
Ability. A study of the findings {19) reveals that the broad range of in- 
telligence found in the lower schools is also present in high school and 
college. The total range of intelligence at the college sophomore and 
senior levels was almost as great as at the high school senior level. The 
college distributions were, of course, skewed, with a much higher propor- 
tion of students at the higher levels. With reference to the college sopho- 
more distribution, the college senior median was one-half standard devia- 
tion above the sophomore median, and the high school senior median was 
approximately three-fourths of a standard deviation below it. 

The educational achievement of the high school and college students 
was measured by the General Culture battery, consisting of tests in general 
science, foreign literature, fine arts, and social studies. The overlapping 
of high school and college classes in achievement as measured by these 
tests is most concisely revealed by the following facts: More than 28 per- 
cent of the college seniors had achievement scores below the average 
college sophomore, and nearly 10 percent had scores below the average 
high school senior. The distribution of scores for the high school seniors 
revealed that 22 percent of them exceeded the average college sophomore 
and that 10 percent exceeded the college senior average. 


In order to measure the variability in achievement of classes within the 
same institution, one representative college was studied intensively. A bat- 
tery of tests was administered to the entire undergraduate student member- 
ship. The battery consisted of the General Culture Tests already mentioned 
plus a two-hour test in English literature, vocabulary, and usage, and a 
test of two hours and twenty minutes in mathematics. It was found that if 
the graduating class that year had been selected from the entire student 
body on the basis of achievement test scores instead of from the senior 
class on the basis of credits earned, only 28 percent of the senior class 
would have graduated. The remainder of the graduates would have been 
made up of 21 percent of the juniors, 19 percent of the sophomores, and 
15 percent of the freshmen. The mean score of the graduating class 
selected on the basis of achievement would have been one standard devia- 
tion above the average of the class that actually graduated, and its mean 
age would have been two years younger. 

The range of general culture achievement with which a college in- 
structor must cope in a typical class is indicated by the fact that in 50 per- 
cent of the college sophomore classes, the range of achievement extended 
from a point below the 25th percentile of high school seniors to a point 
above the 75th percentile of college seniors. 

The Problem of Individual Differences 

Emphasis has been placed on the variability of age and grade groups in 
intelligence and achievement because, in spite of the overwhelming evi- 
dence available, educational practice, if not educational thinking, has in the 
main tended to ignore it. The basic idea persists that grade level as de- 
termined by time spent, exercises performed, and courses passed is closely 
related to intellectual skills, understandings, and usable information. 

Many of the false assumptions which inhibit teachers in meeting the 
needs of individual pupils are corollaries of the idea that grade levels 
signify rather definite stages of achievement. Despite the fact that in the 
elementary school the typical range of intelligence and achievement in a 
grade is from four to eight or more years, the teacher is commonly as- 
sumed to be a grade specialist with specific knowledge and techniques 
appropriate to a given grade and, hence, should not be expected to teach 
pupils who deviate markedly from that level. It is, furthermore, assumed 
that the course of study for a grade is the scheduled academic requirement 
to be administered uniformly to all pupils in the grade, that all pupils 
should be capable of coping successfully with the work outlined for the 
grade, that a pupil should not be promoted to a grade until he is able to 


do the work of that grade, and that when individual differences are pro- 
vided for, all pupils can be brought up to standard. It is still believed by 
many that the strict maintenance of the passing mark results in relatively 
homogeneous classes, that satisfactory instruction for a class can be based 
on uniform textbooks, and that when relative homogeneity does not char- 
acterize a class, it is the result of poor teaching or lax standards. Since 
schools were first graded, such assumptions have thwarted teachers in the 
process of adapting instruction to the abilities of individual pupils. 

Despite the fact that the teacher is supposed to understand individual 
pupils— their aptitudes, abilities, deficiencies, interests, and peculiarities — 
it is not uncommon practice to assign an elementary teacher from thirty to 
fifty new pupils each year (twice a year where semiannual promotion is 
still practiced ) . If grade specialization were considered less important and 
knowledge of pupils more important, the teacher would remain with the 
same pupils for several years. At the high school level, where variability in 
achievement approaches its maximum, the teacher really becomes a depart- 
mentalized specialist in subject matter. Here the teacher with the special 
knowledge, techniques, and textbook meets from one hundred and fifty to 
two hundred different pupils each day. The struggle here is no longer to 
know the abilities, aptitudes, and interests of pupils but only their names. 
In college the attempt to know names is too often given up. 

Attempts to solve the problem of individual differences in the schools 
take two general forms. The first, which has received major emphasis in 
the past, assumes that instructional groups can be made relatively homo- 
geneous with respect to general ability and then subjected to standardized 
educational methods using uniform textbooks, assignments, recitations, 
and examinations adjusted to the level of achievement. The procedures 
relied upon for forming such groups are: (1) grouping on the basis of 
one or more measures of general academic aptitude, such as intelligence 
tests, general educational achievement tests, school marks, and teacher 
opinion; (2) judicious policies of acceleration, promotion, and failure; 
and (3) effective teaching. 

The second approach assumes that heterogeneity resulting from both trait 
differences within the individual and variation between individuals is so 
great that traditional mass instructional procedures must be discarded in 
favor of techniques designed to meet individual needs. The goal here is to 
know and accept the great variability of instructional groups as it exists, 
under even the best of grouping procedures, and then to discover effective 
methods of providing for individual needs and capacities in such hetero- 
geneous groups. Measurement has an important role in the process. 


Effectiveness of general-ability grouping 

American educational procedure for the past century has been largely 
concerned with contriving methods of uniform instruction for postulated 
homogeneous groups of pupils. The period from 1800 to 1850 was one 
of intense educational activity and reform. New states, from Ohio to Iowa, 
were adopting constitutions with educational provisions; all were taking 
their educational responsibilities seriously; some were sending experts to 
Europe to study schools. Horace Mann was at work in Massachusetts, Henry 
Barnard in Connecticut. Some of the results of this activity were the graded 
eight-year elementary school, grade teachers, graded textbooks, school build- 
ings designed to house graded programs, normal schools established to 
teach the techniques of mass instruction, and the creation of the positions 
of state, county, and city superintendent of schools to direct and coordinate 
the organization. By 1870 the schools of the United States, even the one- 
room rural schools, were graded, and uniform state courses of study were 
in vogue. Achievement standards for each grade were determined subjec- 
tively by the authors of the graded texts and courses of study. Promotion 
standards were strict, and before long "laggards in our schools" became a 
national concern. 

Numerous remedies for uneven educational progress were tried between 
1875 and 1925. Some of the earlier remedies had to do with promotional 
policies. Semiannual promotion, quarterly promotion, subject promotion, 
and special promotion each had a trial. It was found that the less severe the 
effects of nonpromotion were made, the more it was practiced. Some 
remedies attempted to hold standards constant and increase the amount 
of instruction for slow pupils, as in the Batavia plan, the assisting-teacher 
plan, and the vacation-classes plan. In some, the course of study was held 
constant, and the amount of time required for slow-, medium-, and fast- 
learning pupils was differentiated, as in the North Denver plan, the Cam- 
bridge plan, and the Portland plan. In others, time-in-school was held 
constant, and the course of study differentiated for slow-, medium-, and 
fast-learning pupils, as in the Santa Barbara and Baltimore plans. Other 
plans involved dividing the course of study in each skill subject into units 
of specified activities and achievements, permitting each pupil to advance 
at his own rate in each skill, with group instruction in content areas, as 
in the Pueblo plan, the Winnetka plan, and the Dalton plan. Several of 
these plans require that pupils be grouped into slow-, medium-, and fast- 
learning groups. The ability to do this seems not to have been seriously 
questioned at the time. 

After World War I the rapid development of group intelligence and 


achievement tests stimulated the practice of general-abihty grouping in the 
elementary and high schools. The plan was first used on a large scale in 
the schools of Detroit in 1920. In that year the 10,000 pupils entering the 
first grade were sectioned on the basis of scores on a group intelligence 
test into three groups designated as X, Y, and Z. The upper 20 percent of 
the children were designated as the X group, the middle 60 percent as the 
Y group, and the lower 20 percent as the Z group. Differentiated cur- 
riculums were provided for each group with the aim of securing the best 
possible experience for the entire range of ability represented. 

The idea of grouping pupils on the basis of some measure of general 
ability was extensively adopted. The degree to which curriculums were 
differentiated to meet the needs of the hypothetical slow, average, and 
bright pupil varied from system to system. Bases for grouping also differed, 
as did the proportion of pupils placed in the different groups. In 1936 
the National Society for the Study of Education devoted Part I of its 
Thirty-Fifth Yearbook to a critical evaluation of practices in the grouping 
of pupils. General-ability grouping was criticized and defended on educa- 
tional, philosophical, social, and psychological grounds. Experimental 
studies were summarized and the conclusion reached that the evidence 
slightly favored ability grouping where adaptations of standards, materials, 
and methods were made. The pupils in dull groups seemed to profit 
slightly, with pupils in bright groups frequently showing lower achieve- 
ment than when placed in more heterogeneous groups. However, little 
attention was given to the question of the extent to which general-ability 
grouping reduced the variability of instructional groups in specific subject 

General-ability grouping is based on the hypothesis that there is rela- 
tively little variation from trait -to trait within the individual, that all traits 
with which the school is concerned are substantially correlated, and that 
mental functions are organized around a predominating general factor 
which determines the general-competence level of the individual. Evidence 
from several overlapping fields of investigation tends to refute this hypoth- 
esis. One is concerned with basic theories of mental organization, a 
second, with studies of so-called idiots savants, a third, with asymmetry 
of development in "normal" and "gifted" individuals, a fourth, with direct 
measures of trait variability, and a fifth, with evidence of correlations be- 
tween traits. This research has been well summarized by Anastasi and 
Foley (2). General conclusions for our purposes may be based on Hull's 
{15) study of the variability in amount of different traits possessed by the 
individual. His subjects were 107 ninth-grade boys to whom he adminis- 


tered thirty-five standardized psychological and educational tests involving 
a wide variety of functions. Hull concludes that trait differences in the 
typical individual in this group were 80 percent as great as individual 
differences in the total group, that trait differences tend to conform to the 
normal curve, are more than twice as great in some individuals as in others, 
and that no relationship exists between the individual's general level of 
ability and the extent of his trait variability. 

Studies of the extent to which the variability of elementary school classes 
can be reduced through general-ability grouping have been made by Hol- 
lingshead (14) and Burr (4). Hollingshead was primarily concerned with 
determining the best measure of general ability for classification purposes 
and Burr with the extent to which general-ability grouping reduces vari- 
ability in reading and arithmetic. Educational achievement test batteries 
were found to be the most effective basis for grouping. The variability of 
the typical X, Y, and Z section in reading and arithmetic was found to be 
approximately 80 percent of the total grade range. Individual pupils were 
found to be such complexes of more or less independent abilities that 
when sections were made nonoverlapping in one phase of a subject such 
as arithmetic reasoning, they overlapped greatly in another phase such as 
arithmetic computation. 

The important generalization to be drawn from studies of trait variability 
is that instructional groups formed by general-ability grouping are not 
sufficiently homogeneous to warrant uniform mass instructional procedures. 
For example, a typical sixth-grade class has a range of eight years in read- 
ing ability. After X, Y, Z sectioning on the basis of educational age, each 
section will still show a range of from five to seven years. Regardless of 
grouping procedure the teacher's attention must constantly be directed to 
individual children and their immediate problems in learning. The obliga- 
tion of the school to furnish instructional materials with a range of diffi- 
culty commensurate with the range of ability in a class and to individualize 
instruction is as great when general-ability grouping is practiced as when 
it is not. 

Effectiveness of judicious policies of promotion 

Teachers and administrators at the high school and college levels fre- 
quently attribute the wide range of ability which they find in classes to lax 
promotion practices in the lower schools. They maintain that the practice 
of universal promotion which has crept into the lower school to hide its 
failures and keep the community happy is responsible for the wide range 
of individual differences in the upper school. What is needed, they argue, 


is meaningful grade standards which must be achieved before a pupil is 
promoted. This would be an honest policy, they argue, and would require 
pupils to earn their promotions. Grade levels would mean something and 
really signify definite stages of educational development. More rigorous 
policies would raise the standards of our schools. Pupils would be stimu- 
lated to greater effort. According to this line of argument, pupils who are 
promoted regardless of achievement suffer emotionally from a progressive 
inadequacy to deal successfully with the school situation. They become 
discouraged, quit trying, and learn that they can "get by" without effort. 
If individual differences are provided for, they conclude, almost all pupils 
can attain a respectable standard. We must provide for those differences 
and get all pupils up to that standard. An investigation reported by Cook 
(7) tests the validity of several of these claims. Complete test records were 
available for 148 Minnesota school systems. These systems were first 
ranked on the basis of the amount of retardation. Then nine systems that 
approached the universal promotion end of the scale were matched with 
nine systems which maintained rigorous standards of promotion. Matching 
was on the basis of size of school, socioeconomic status of parents, and 
professional qualifications of teachers. 

It was found that schools which have relatively high standards of pro- 
motion (retard the dull and accelerate the bright) tend to have a higher 
proportion of over-age, slow-learning pupils, since such pupils remain in 
school from one to several years longer. The higher proportion of such 
pupils reduced significantly the mean mental age and achievement level 
of grade groups in these schools. 

In comparing the variability of the two groups of schools on eleven 
achievement tests and intelligence at the seventh-grade level, no significant 
difference was found. The higher proportion of low-ability pupils in the 
schools with high rates of retardation tended to keep the variability of 
classes large. Pupils were rarely failed more than twice in the same grade 
and eventually reached the upper grades in spite of the efforts to maintain 

When the achievement of pupils of the same age and intelligence in the 
two groups of schools was compared, there was no reliable difference. 
This would tend to indicate that the schools were well matched and also 
that the constant threat of failure did not increase achievement in the 
schools attempting to maintain high standards. It also emphasizes the fact 
that the retention of a large number of low-ability pupils through non- 
promotion tends actually to reduce grade standards and aggravate the 
range of ability problem. 


As long as all the children of all the people remain in school, it will be 
impossible to reduce the variability of instructional groups significantly 
through promotion policies. If rigorous policies of promotion are adhered 
to, the efficiency of the school is reduced through the accumulation of low- 
ability pupils and the lessening of educational opportunities for the more 

Effective teaching procedures influence individttal differences 

The idea that instructional groups can be made more homogeneous in a 
given achievement area through effective teaching Is quite common. It 
keeps company with other ideas inherent in the traditional conception of 
the schooling process. According to this point of view the course of study 
outlines what is to be taught in each subject each year, what textbooks are 
to be used, and what pages are to be covered by a given date. Education 
consists of learning such things as are found in courses of study and text- 
books: spelling words, type problems in arithmetic, names, dates, causes 
and results of wars, states and capitals, explorers and where they explored, 
cities and their characteristics, countries and their products, rules for 
punctuation and capitalization, and the seven basic food groups. 

Conscientious teachers with this point of view who have taught a given 
grade for a number of years know this material well, and they know how 
to threaten, coax, drill, drive, review, and test pupils until they learn it. 
There is a passing mark, perhaps 70 or 75, and this is interpreted by the 
teacher as meaning that this percentage of what is taught must be learned 
in order to pass to the next grade. The aim is to get all pupils over the 
passing mark, to make them homogeneous with respect to the goals, to 
provide for individual differences in the class, and to get them all up to 
standard. The conscientious teacher does these things. Why, then, should 
the range of achievement be what it is in these classes? What is wrong with 
these goals.^ Is there harm in striving for uniformity of achievement.'' Can 
homogeneous instructional groups be achieved in this way? 

The problem with which we are dealing is basically concerned with the 
effect of a period of learning upon individual differences. Are individuals 
more alike or less alike with respect to a given ability after a period of 
instruction? Does good teaching increase or decrease the variability of a 
class? A considerable amount of research on this question is available and 
has been summarized by Anastasi (i), Peterson and Barlow {30), and 
Reed {51). The research indications are somewhat contradictory, and 
many technical problems are involved in their interpretation. But for our 
purposes the following generalization seems warranted: if the responses 


to be learned are sufficiently simple and the goals that have been set so 
limited that a high proportion of the group can master them during the 
period of learning, the variability of the group becomes less; if the task is 
complex and the goals unlimited, so that the abilities of the most apt 
members of the group are taxed during the period of learning, the vari- 
ability of the group increases. 

The point of vievv^ that instructional groups can be made more homo- 
geneous in a given achievement area through effective teaching presup- 
poses limited educational objectives — limited not only with respect to the 
complexity of the learning but also with respect to the amount of even 
the simple learnings. In order to place requirements within the reach of 
the less able student, emphasis is commonly placed on rote factual learn- 
ing and type problem solving. Textbooks in the content areas are frequently 
written down to a difficulty level of two or three years below the grade at 
which they are to be used. Materials are selected for inclusion in each 
grade curriculum which can be mastered by at least 80 percent of the 
pupils. Questions and answers are drilled upon before examinations with 
the understanding that the final examination questions will be taken from 
the list. Frequently the teacher's time is devoted almost exclusively to the 
slow learners. In such schools when pupil progress is measured over a 
period of time in terms of the limited goals, pupils with the lowest initial 
scores make the greatest progress. There tends to be a negative relation- 
ship between initial scores and gain. The relationship between gain and 
intelligence is low and frequently negative. These tendencies indicate that 
pupils of high and even average ability are understimulated and essentially 

The most serious result of emphasizing limited goals in education is 
that what is learned is frequently of little value and is retained for but a 
short period. Only rote memory, a low-order mental process, is required 
to pass the name, describe, define, luho, what, when, and where type of 
examination with which achievement is frequently measured. Tests of 
retention given from three months to three years after a course is completed 
reveal that from 40 to 80 percent of the information required by the final 
examination is lost. Buckingham (3) has called this the greatest waste in 
education. Investigations in algebra, botany, chemistry, physiology, psy- 
chology, and zoology all reveal the same rapid rate of forgetting. The 
forgetting curves closely approximate those for nonsense materials, indicat- 
ing that much of what is learned for examination purposes is no better 
organized and no more useful than nonsense materials. 

The relative permanency of different types of learning has been investi- 


gated by Tyler (38) and Wert {39). In Tyler's study a test in zoology 
measuring five objectives vv'as administered to eighty-two students at the 
beginning of the course, at the end of the course, and fifteen months later. 
Percent of loss or gain on each part of the test during the fifteen-month 
period was computed in relation to the amount gained during the course. 
On the part of the test requiring: ( 1 ) names of organs identified from pic- 
tures, the loss was 22 percent; (2) recognition of technical terms, the loss 
was 72 percent; (3) recall of facts, the loss was 80 percent; (4) applica- 
tion of principles, there was no loss or gain; (5) interpretation of new ex- 
periments, there was a gain of 126 percent. Wert's experiment, also in 
zoology, is quite similar. This experiment measured percent of loss or gain 
over a period of three years in relation to the amount gained during the 
course. A gain of 60 percent was found in application of principles to new 
situations, and a gain of 20 percent in interpretation of new experiments. 
There was a loss of over 50 percent in terminology, function of structures, 
and main ideas; and a loss of over 80 percent in associating names with 

These experiments indicate that learning involving problem-solving 
relationships and the operation of the higher mental processes are relatively 
permanent and that unrelated facts and mere information are relatively 
temporary. Unless learning involves differentiation and integration of 
old and new responses into a problem-solving type of mental process or 
into an organized behavior pattern, it has little permanence or value. How 
was it learned? is the important question. 

The permanent results of the educational process are measured by such 
tests as : ( 1 ) vocabulary, ( 2 ) reading comprehension in the natural sci- 
ences, social sciences, and literature, (3) problem-solving in mathematics 
and the sciences, (4) ability to use the library and basic reference materials, 
and (5) ability to write and speak effectively. It is with reference to such 
objectives that great heterogeneity in achievement exists, and the more 
effective the instruction, the greater the heterogeneity. 

It would seem then that the emphasis which some schools place on 
striving for homogeneity in classes, getting students over the passing mark, 
and providing for individual needs with a view to bringing all pupils up 
to a standard, encourages teachers to set limited goals for instruction which 
result in temporary factual learning involving mainly the lower mental 
processes. When the ultimate goals of education, involving the higher 
mental processes and permanent learnings are striven for and each student 
is stimulated to capacity effort, the variability of instructional groups in- 



Measurement as an Aid in Meeting Individual Needs 
IN Heterogeneous Groups 

To direct the educational experiences of each child in such a way as to 
bring about his optimum development and adjustment to his culture 
presents one of the most complex problems imaginable. Materials presented 
up to this point were selected with a view to revealing the extent and 
nature of individual and trait differences and the general effect of educa- 
tional experiences upon them. Since many of the educational devices and 
procedures directed to this end have in the past sprung from misconcep- 
tions regarding the facts and have oversimplified the problem, this ap- 
proach seemed necessary. Recommendations for the uses of measurement in 
solving the problem must begin with these facts and be limited by them. 
The approach must be experimental and tentative to a high degree, since 
interpretations and implications, even when drawn from facts, are subject 
to error. 

In general social intercourse during the years of maturation, the traits 
most important to the child in determining membership and status in a 
congenial group are chronological age and general physical development. 
Together with general social development these are also the most obvious. 
In a graded school it is tremendously important to a child that he be 
grouped with his peers. To deny him this privilege is to violate one of the 
most important requirements of a favorable learning situation. Therefore, 
throughout the period of maturation, which corresponds roughly with the 
compulsory school age, these traits should constitute the fundamental basis 
for educational grouping, that is, when a child is five he enters kinder- 
garten, when six he enters the elementary school, when twelve he enters 
the junior high, and when fifteen the senior high school. Since chronological 
age is not perfectly related to physical and social development, some 
adjustments have to be made, especially at the primary level with develop- 
mental level taking precedence over chronological age. However, the all- 
important factors in the basic grouping of pupils are physical and social 
development. A child should live and work in the group he most obviously 
belongs with, which accepts him and which he accepts. It will be recalled 
that both intelligence and achievement test data show age groups no more 
variable than grade groups. Hence this grouping will not materially increase 
the range of ability vjith which the teacher must cope. 

The secondary basis for grouping should be achievement in specific 
areas, that is, for reading instruction children should be grouped in terms 
of reading ability and needs, for English instruction in terms of needs in 
English. It should be remembered, however, that even when grouped in 


this manner attention must be focused on the individual pupil, since two 
pupils making exactly the same score on a test will have different abihties 
and needs. 

If physical and social development is accepted as the primary basis for 
grouping in the common school, all assumptions that a grade level indi- 
cates anything specific regarding intellectual competence or educational 
achievement must be given up. Evidence has been presented indicating that 
grade level, per se, never truly signifies these things. The assumption that 
it does has led to absurd practices, subterfuge, and confusion in meeting 
individual needs. The determination of intellectual competence and educa- 
tional achievement must rest primarily upon measurement. Teacher ob- 
servation and judgment will always be of extreme importance in the 
learning situation and in the appraisal of those traits not yet measurable. 
But measurement will carry the burden of information on educational 

Diplomas in the common schools will continue to be given by virtue of 
years attended, age attained, and courses taken, but they will be assumed to 
convey little or no further meaning. The common school period probably 
should be the same length for all pupils. No attempt should be made to 
bring the slow pupil up to standard by keeping him in school a few extra 
years. No attempt should be made to keep the bright pupil down to stand- 
ard by accelerating him through school at an early age. If instruction is to be 
adapted to individual needs and capacities, the diploma should be con- 
sidered little more than a certificate of attendance, and the level of 
achievement attained in the various areas should be determined through 
measurement procedures. 

To set respectable standards of educational achievement for graduation in 
schools attended by all the children of all the people means automatic 
labeling as failures, with moral and social approval implications, of a large 
percentage of pupils. It also means that pupils below this level in compe- 
tence will, with the encouragement of teachers and parents, attempt learn- 
ings that are beyond their capacity. The status of pupils well above the 
standards will not be adequately represented by the diploma, nor will 
their capacities be tested by the curriculum. Measurement makes possible 
the elimination of a set standard for all pupils who finish the common 
schools. The expected achievement of each pupil can be related directly to 
his capacity and to the requirements of his probable adult vocational posi- 
tion in each of the learning areas. His educational status and growth in 
these areas can be more accurately determined and communicated through 


The general limitations of diplomas, and to some extent school marks, 
and percentile rank in class in accurately representing academic capacity 
and status for transfer and certification purposes are well known. There are 
several reasons for this. Teachers differ in grading standards, classes differ 
in quality of students, and schools differ in both grading standards and 
academic capacity of students. In state-wide high school testing programs 
it is frequently found that schools differ so much in achievement in the 
various subjects that the distributions of scores for the extreme schools 
are nonoverlapping. In ninth-grade algebra, for example, the lowest-scor- 
ing pupil in the highest-scoring school is above the highest-scoring pupil 
in the lowest-scoring school. The pupil who receives an F in the highest- 
scoring school would receive an A if enrolled in the lowest-scoring school. 

Wood (40) has proposed a dual system of marks to meet the two related, 
but independent, purposes for which grades are given. He recommends 
that for certification and transfer purposes, comparable educational meas- 
ures be based on the best standardized educational tests. And for the 
moral, social, and disciplinary welfare of the individual pupil he recom- 
mends that the local school marks be retained. 

If local school marks are based on relative standing in a class with a wide 
range of capacity, their moral, social, and disciplinary value is questionable. 
If, however, marks are based on achievement in relation to well-established 
capacity, or on the achievement of objectives peculiar to the school, or 
those not yet measurable in an objective sense, they have an important 

With reference to valid educational goals for which standard scales of 
development are available, there would seem to be little reason for indicat- 
ing progress in other than objective terms. The practice in some schools 
of converting standardized achievement test scores into the more or less 
ambiguous school mark is indefensible. For example, if a pupil has made 
a gain in reading age of from ten years in September to ten years and nine 
months in January, this fact should be known to all persons concerned, 
including the pupil and his parents. To convert such a fact into a school 
mark is absurd. (Development in the measurable educational achievement 
areas should be considered as objectively as is growth in height and 
weight. Certainly the suggestion that a pupil's growth in height be com- 
pared with that of others in the same class and a grade assigned would be 
considered absurd.) A gain of three months by another pupil during this 
same period may be properly considered a greater achievement relative 
to capacity. Individual development becomes the all-important considera- 
tion. When a testing program makes such measures of achievement sys- 



tematically and periodically available, traditional school marks lose much 
of their significance. Also the types of educational experiences which 
produce the greatest development are more quickly and accurately deter- 

The use of objective measurement fosters the child-development point 
of view and makes obvious the necessity of adjustment of instruction to 
individual capacity. Such a point of view tends not to develop in a school 
which assesses educational progress in terms of subjective marks based on 
relative standing in a class. 

The emphasis placed on the value of measurement in guiding educa- 
tional development in the schools attended by all children leaves no 
place for the use of measurement as a flunking device or as a basis for 
exclusion from school except at levels so low that institutional commitment 
is indicated. In professional schools and colleges measurement has a 
definite and justifiable function as an exclusion and flunking device. Indi- 
viduals without the necessary aptitudes and abilities who aspire to teach 
school, or practice law, engineering, dentistry, medicine, and so forth, 
must be eliminated in the interest of the public welfare and economy. 

Suggested administrative policies 

The traditional graded elementary school and the departmentalized 
high school, with their emphasis on required curriculums,uniform courses 
of study, uniform textbooks, heavy pupil-teacher load and the like, were 
designed without regard for individual and trait differences. The purpose 
was to give a "shotgun" type of education to as many pupils as possible 
with the least expenditure of money. Educational policy was dictated 
largely by available funds, not by pupil needs. Even in such schools meas- 
urement has increased educational efficiency by making the teacher aware of 
individual aptitudes, abilities, and specific needs with an economy of 
time and effort. But assuming that the teacher has the "developmental" 
rather than the "subject-matter-to-be-covered" point of view and knows 
how to stimulate pupils to maximum educational effort, she still lacks 
classroom space, time, equipment, books, supplies, freedom, encourage- 
ment, and energy to do the job. Many highly competent and conscientious 
teachers leave the profession yearly because of the resulting frustration. 
The relatively simple problem of providing a classroom with reading 
material which has a difficulty range and interest appeal commensurate 
with the abilities and interests of the class is rarely attacked seriously. 

The administrative policies designed to enable teachers to meet the 
needs of individual pupils in heterogeneous groups have two purposes: 


( 1 ) to make it possible for the teacher to know the pupil better — to know 
his abilities, his interests, and his deficiencies well enough to direct his 
learning; and (2) to provide instructional materials with a range of 
diflSculty and content commensurate with the range of abilities and interests 
of the instructional group. 

1. The testing program of the school should be systematic and com- 
prehensive. It should furnish the teacher with up-to-date information re- 
garding the growth record and status of each of his pupils in at least the 
fields of English, reading comprehension, mathematics, study skills, and 
problem-solving in the natural and social sciences. The tests should measure 
at regular intervals the permanent learnings which have been achieved 
toward the major and ultimate objectives of education. Knowledge of the 
pupil and his record of achievement should be considered basic data in the 
educational process. 

Measurement should begin with a relatively undifferentiated test of the 
individual Binet type at the preprimary level and reach a considerable 
degree of difi^erentiation in terms of improvable skills, abilities, and atti- 
tudes by the junior high school level. During the primary school years the 
differentiation should be in terms of specific abilities related to developing 
number concepts, reading readiness, some aspects of beginning reading 
achievement, and specific behavior related to health and socialization. 
Group testing which assumes reading ability is generally unsatisfactory 
below the fourth grade. During the intermediate school years, however, 
the approach to complete differentiation in terms of ultimate educational 
objectives can be made. The selection of tests should emphasize the more 
permanent learnings and skills and discourage mere factual learning. 

The results of the testing should always be portrayed in graphic profile 
form, showing the growth in each differentiated ability from year to year. 
This test record should be in the hands of the teacher in the permanent 
record folder, which should contain information on other important 
aspects of the child's development, the health record, and information on 
social and emotional development. It should also include pertinent informa- 
tion about the child's family, his interests, and activities outside of school, 
as well as samples of the best art, handwriting, poetry, and composition 
produced each year. 

Systematic testing of the type described above should preferably be done 
at the beginning of the school year. This time of testing has several distinct 
advantages. It provides the teacher with up-to-date information regarding 
the educational status of each child at the beginning of the period of in- 
struction when attention should be focused on planning in terms of indi- 


vidual needs. The temptation to use the tests to determine promotion or 
failure is minimized. When a child knows his status in improvable skills 
at the beginning of a year and considers his progress during the preceding 
year, there is usually an urge to better his previous record. When tests are 
given at the close of the school year the efforts of the teacher are more 
likely to be centered on preparing children for the tests and result in 
undesirable "cramming" procedures. 

Other tests of a more diagnostic nature in all the basic learning areas 
should be available to teachers at all times in order to determine individual 
needs and measure progress in the skills and abilities being emphasized in 
instruction. These tests should always be selected and used strictly from 
the standpoint of their instructional value. 

2. Grouping within the class on the basis of status in specific learning 
areas is one of the most essential procedures in meeting individual needs. 
In the elementary school from three to five groups within each class are 
common, with the pupils grouped differently in each subject or ability area. 
The within-class grouping in reading, for example, may be based on ability 
in general reading comprehension or on a more detailed analysis of indi- 
vidual needs, or both. Some of the groupings may be continued over a 
relatively long period; others may be for only a day or two or until the 
need for them no longer exists. Pupils may be transferred in or out of a 
group at any time. Grouping should be practiced not only in terms of needs 
in skill areas but also in terms of interests in a topic, or in terms of per- 
sonality and social needs. The latter groups commonly take the form of 
committees selected to carry through a special project. Individual assign- 
ments of responsibility have the same purposes. Much of the work of the 
class, however, will still involve the entire group. This is necessary in 
planning, coordinating, and unifying the over-all activities and goals of the 
class. The unification of a class is achieved largely through the unit activi- 
ties. The pupil needs to feel most strongly his membership in, and accept- 
ance by, the total group, the other groupings being for special needs and 
purposes which he considers subsidiary to the over-all purposes of the total 

3. The teaching load must be reduced. Teachers skilled in the art of 
adapting group procedures to the needs of individual pupils insist that no 
primary teacher should be responsible for more than twenty-five pupils and 
that thirty pupils should be the maximum load of intermediate grade 
teachers. At the high school and college levels the teacher-student load 
determines how much individual attention is possible. 

4. Many elementary schools permit the teacher to remain with the same 


pupils for more than one year. In some schools the pupils have the same 
teacher from kindergarten through the sixth year; in others, one teacher 
through the primary years, and another through the intermediate years. 
The continuing-teacher plan eliminates the concept of a grade teacher and 
places emphasis on knowing the pupil and his needs, knowing parents, and 
thinking in terms of child development. The teacher is enabled to start 
each year with a thorough knowledge of his pupils and can plan the work 
of the year in terms of specific needs. The process of promotion is elim- 

At the junior and senior high school levels where the typical teacher 
meets 150 pupils each day, the problem of making it possible for teachers 
to know their pupils better can be alleviated to some extent by having 
home-room teachers teach their home-room groups for at least two periods, 
preferably in the core subjects — social studies and English. These two sub- 
jects should be integrated, utilizing the social studies and literature as 
content and stimulation material in learning to read, write, speak, and listen 

5. The plan of reporting to parents percentage marks, or letter grades, 
is not consistent with the policy of meeting the needs of individual pupils. 
Likewise, the practice of simply marking a pupil satisfactory or unsatisfac- 
tory in terms of his general learning capacity is inadequate. The weakness 
of these marking and reporting methods is not that they tell too much about 
the pupil, but rather that they tell too little and their meaning is ambiguous. 

It is more beneficial and revealing to all concerned if the teacher once or 
twice a year, oftener when desirable, sits down with the parents and dis- 
cusses the child's achievements and needs, going through his record folder 
and noting measures of growth and status, his achievement in the basic 
skills and toward the major educational objectives, discussing samples of 
his work, his behavior characteristics, and his personality needs. 

6. A wealth of instructional material must be provided. Instructional 
materials must have a range of difficulty, interest appeal, and content com- 
mensurate with the range of abilities and interests of the class. In the ele- 
mentary school, classroom libraries must be provided which contain the basic 
reference materials, books at appropriate levels of difficulty on the units 
to be developed by the class, children's literature in abundance, and books 
on popular mechanics, hobbies, and how to make things. The school library 
should constantly feed into and supplement the room libraries, but it should 
not supplant them. 

In addition to a wealth of books, there must be in the elementary class- 
room children's magazines and newspapers, art materials, tools, and work- 


benches, simple science laboratory equipment, a science cabinet and sink, 
large bulletin boards, as well as blackboards, pictures, visual aids of all 
types, movable desks, tables and chairs, heavy and light screens for dividing 
the room, and a combination radio-phonograph. In the high school and 
college the equipment of rooms will be more specialized, but the pattern 
of adequacy suggested for the elementary school should be followed. 

Adjustment of curriculum policies ami practices 

It is obvious that if in the elementary school all pupils are required to 
read the same books, do the same exercises, solve the same problems, and 
attain the same minimum standards on examinations, there can be but slight 
recognition of individual interests and abilities. Likewise, in the high school 
and college if curriculum requirements are rigid, if students are required 
to take certain subjects, or a sequence of courses in a given curriculum, 
recognition of individual needs is thwarted. This is true, whether the re- 
quirements are imposed by the state, the school-standardizing agencies, or 
the local school administration. 

Whenever a school purports to accept all the children of all the people, 
it must strive for a curriculum sufficiently flexible and broadened to recog- ^ 

nize and reward the great variety of combinations of aptitudes and interests 
of its pupils, enabling them to know their peculiar strengths and weak- 
nesses and preparing them to fit into our complex society with its multi- 
plicity of demands. The curriculum must be modified to provide flexibility 
of requirements, and the teachers and counselors must be free to plan for 
the welfare and optimum development of individual pupils. 

1. In general education the curriculum content should be organized for 
the teachers' guidance around significant aspects of the social and natural 
environment for the purpose of developing understanding and intelligent 
behavior with respect to them. It must be recognized that the understanding 
of the immediate cultural and physical environment requires an understand- 
ing of other cultures in other environments as determined by evolutionary, 
geographical, and biological factors. 

As far as pupils and their learning are concerned, the foregoing should 
be the result, not the starting point. For pupils, the curriculum is organized 
around their purposes and understandings. The teacher must guide the 
pupils' purposes toward understanding the social and natural environment. 

The curriculum content in the social and natural sciences should be or- 
ganized around purposeful problems with a view to: 

a) Making possible the use of a wide variety of stimulating educational 
material from literature, factual source books, visual and auditory 


aids, with the local social and physical environment as a laboratory. 

b) Making possible an appeal to many and varied interests. 

c) Making possible the utilization of reading material with a wide range 
of difficulty, content, and interest appeal. 

d) Stimulating and making possible a wide range of activities in read- 
ing, research, problem-solving, discussion, use of reference materials, 
writing and giving reports, letter-writing, organizing materials, plan- 
ning, observing relationships, drawing conclusions, formulating gen- 
eralizations, dramatization, understanding and using all forms of art, 
construction activities, using mathematics in a functional way, taking 
responsibility and cooperating in group projects and activities, all to 
the purpose of developing skills, understandings, ideals, values, be- 
liefs, and attitudes. 

e) Giving purpose to, and affording a use for, the basic skills in the 
language arts, reading, and mathematics. Skills in these areas must 
be given constant, definite, and systematic attention, but much prac- 
tice in their use should come in the social and natural sciences. 

2. The grade levels at which certain knowledge, skills, and abilities 
should be learned cannot be determined with any degree of specificity. 
It is common practice, for example, in the mechanics of capitalization, to 
specify that children should learn in the third grade to capitalize "Mr.," 
"Mrs.," and "Miss"; in the fourth grade, the names of cities, states, streets, 
and organizations; in the fifth grade, the names of persons and firms and 
the first word in a line of poetry; in the sixth grade, the names of the Deity, 
the Bible, and abbreviations of titles and proper names. Experienced teach- 
ers know that some pupils will learn all these by the close of the third grade; 
others will not have learned them by the time they graduate from high 
school. The teacher must be prepared to lead each child through the next 
steps in his development, regardless of the level he has achieved. 

Similar graded lists of things-to-be-learned are provided in traditional in- 
structional materials in all subjects: lists of words in spelling, lists of 
processes in arithmetic, lists of exercises in handwriting, lists of capitaliza- 
tion, punctuation, and usage rules in language, lists of skills to be de- 
veloped in reading, and so forth. Properly used, these lists of essential 
learnings have value. Two values will be given attention. First, they may 
be considered as learnings to be introduced and emphasized when their 
purpose is clear. (The purpose of the most essential basic skills is clear even 
for the first-grade child.) They should never be considered as things to be 
learned in a one, two, three fashion, once and for all time, out of their 
functional setting and natural content, or as centers around which all in- 


struction should be organized. The practice of organizing the curriculum 
almost entirely around these piecemeal, itemized goals has been the greatest 
limitation of the traditional school. If we learned to play golf this way, we 
should spend all our time with the instructor, drilling on itemized elements 
of the game, but never playing a game of golf. The purposes of drill in 
golf emerge from the game. 

The second use to be made of such lists is in the systematic and econom- 
ical diagnosis of individual deficiencies. At regular intervals the pupils 
should be tested for knowledge of essential spellings, essential processes in 
arithmetic, essential handwriting elements, essential English skills, and 
essential reading skills, the purpose being to keep both the pupil and the 
teacher constantly aware of individual deficiencies and developments. Such 
tests can be of the informal, teacher-made type, covering the spelling list 
for the grade at the rate of twenty words per week, mixed fundamental 
problems in arithmetic covering every process learned up to the time of the 
test, dictation exercises in English covering various aspects of the mechanics 
of writing, or they may be the more formal standardized materials such as 
tests of the various reading skills and diagnostic charts for handwriting. 

3. Since life outside of school recognizes and rewards a great variety 
of aptitudes and combinations of aptitudes, the school should do the same. 
The traditional school has too often recognized and rewarded only docility 
and a facile memory. Teachers in such schools have been surprised when 
pupils they had considered hopeless achieved success in later life. The 
broadening of the elementary, high school, and college curriculums to 
include various forms of practical arts, fine arts, a school paper, athletics, 
extended educational field trips, participation in community affairs, stimu- 
lation of hobbies, participation in school government, the safety patrol, 
radio programs, work experience, and community health programs is evi- 
dence of the acceptance of this principle. The common school should be a 
proving ground in which the individual discovers his peculiar strengths and 
weaknesses. If every child is to find himself, the school's opportunities for 
development must be as broad as the demands of the culture, and the 
requirements must be within the limits set by the possibility of reality, 
purpose, and meaning for the individual child. , 

Functions of Measurement in the Diagnosis and Treatnnent 
of Learning Difficulties 

Effective learning results in complex behavior patterns which may be 
differentiated into more or less related hierarchies of habits, skills, under- 
standings, feelings, and desires. The product is infinitely complex, the 


process of building it linear, and to a considerable degree sequential, uhe 
role of the school is to specify within the limits of individual capacities the 
behavior of educated people and then to determine the most effective 
sequences of experiences to bring it aboutJIf sequence were not important, 
there would be little need for schools, since living in the complex culture 
would develop the necessary complex learnings. To make school experi- 
ences too lifelike would destroy the sequential organization of experiences 
essential to efficient learning. The relative importance of sequence and of 
the various criteria for determining it in the different areas of learning are 
points of issue between schools of educational thought. 

In general, the criteria for determining optimum sequence of experiences 
are of two kinds : ( 1 ) those related to the physical, mental, and emotional 
maturation of the child, and ( 2 ) those related to the nature and complexity 
of the behavior to be learned. These are really different aspects of the same 
developmental process. Properly conceived, they both result in sequences 
which are challenging, purposeful, and meaningful to the learner. The 
"child centered" approach emphasizes the maturation process with a tend- 
ency to ignore goals, while the "culture centered" approach emphasizes the 
selection, refining, and grading of subject matter in the direction of definite 
goals. The first tends in practice to receive the greatest emphasis during 
the period of rapid maturation (primary and elementary), the second, as 
maturity is approached (high school and college). 

That the development of motor, social, and intellectual abilities is 
highly sequential in nature is attested by the highly reliable age scales which 
have been constructed. The facts previously presented regarding the nature 
and extent of individual and trait differences are based on such scales. In 
general they indicate that individuals differ greatly in the rate of develop- 
ment of a given trait and in the level of development attained at maturity, 
also that the various traits of an individual develop at different rates and 
reach different levels at maturity. 

Because of the waste and discouragement involved in attempting to teach 
again what the learner already knows or attempting to teach him at a level 
far beyond his present attainment, it is important that procedures be insti- 
tuted for informing both the teacher and the learner of his status in a given 
sequence and what the nature of the next educational experiences should 
be if optimum development is to be achieved. This is true whether the 
curriculum is organized around itemized, piecemeal, socially validated goals 
in which sequence has been experimentally established, or around experi- 
ence units in which functional relationships, meaning, and pupil-initiated 
activities determine sequence to a large degree. The variability of develop- 
ment in instructional groups is equally large under either system of cur- 


riculum organization. Actually, both approaches have a definite place in 
curriculum development as long as the goals are seen in relation to the 
potentialities of the individual learner, to his probable future vocational 
status and needs, and when the teacher has a point of view and procedure 
which welcome and reward the IQ of 75 as well as the IQ of 125. 

Whether the process of determining pupil status in a given develop- 
mental area and adjusting instruction to status should be called remedial 
teaching is questionable. It would seem to be just good teaching. It has been 
called remedial teaching because of erroneous ideas of the nature and extent 
of individual and trait differences and faulty conceptions of the schooling 
process based on them. The common criteria of need for remedial instruc- 
tion have been: ( 1 ) discrepancy between measured intelligence and achieve- 
ment in a given area, and (2) achievement status below grade status in a 
given area. The first criterion is based on the assumptions that individuals 
have equal aptitude in all areas and that this aptitude is measured by an 
intelligence test; the second assumes that all children should achieve up to 
the norms established for their age_oj grade group. Both of these assump- 
tions we have found to be false. |TJi£ best measure of what an individual* 
should achieve in a given area is past achievement in that areajThe need 
for remedial attention is indicated when progress in an area is stopped or 
markedly slowed down over a period of time. 

The determination of pupil status in a given area of achievement and 
the adjustment of instruction to status should be a continuing process in 
every classroom. The role of testing in the process is an important one. 
Both achievement tests and diagnostic tests have important functions to 
perform. A general achievement test is one designed to express in terms 
of a single score a pupil's achievement in a given field of achievement. 
A diagnostic test is one intended to discover specific deficiencies in learning 
or teaching. It is a test in which a single total or composite score is of little 
or no significance, but on which the part scores or the percentages of cor- 
rect responses to individual items are the measures sought. Tests may be 
diagnostic in various degrees. A test in English correctness, for example, 
may break the whole field up into such divisions as spelling, capitalization, 
punctuation, grammar, and usage, yielding a part score for each division, 
or may still further analyze each of these divisions, splitting the section on 
punctuation into tests on the use of the comma, period, semicolon, and so 
forth. Or it may make an even more detailed analysis, considering sep- 
arately, for example, each of the types of situations in which, for instance, 
the comma may be used. To the degree, then, that the emphasis in the test 
is placed upon the part scores or upon percentages of responses to individual 
items, that test is of the diagnostic type. To the degree that the emphasis 


is placed upon a single total score, designed to yield a measure of general 
achievement, the test is of the general achievement type. Perhaps the ma- 
jority of the tests constructed for informal use by the classroom teacher 
are or should be of the diagnostic type. 

The more general achievement test batteries which yield scores in vocab- 
ulary, reading comprehension, arithmetic reasoning, arithmetic computation, 
etc., have limited value in the planning of instruction for specific pupils. 
The major functions of such comprehensive batteries may be summarized 
briefly as follows: 

1. To direct curriculum emphasis by: 

a) Focusing attention on as many of the important ultimate objec- 
tives of education as possible. 

^) Clarification of educational objectives to teachers and pupils. 

f) Determining elements of strength and weakness in the instruc- 
tional program of the school. 

d^ Discovering inadequacies in curriculum content and organization. 

2. To provide for education guidance of pupils by: 

a) Providing a basis for predicting individual pupil achievement in 

each learning area. 
^) Serving as a basis for the preliminary grouping of pupils in each 

learning area, 
r) Discovering special aptitudes and disabilities. 
<:/) -Determining the difficulty of material a pupil can read with 

^) Determining the level of problem-solving ability in various areas. 

3. To stimulate the learning activities of pupils by: 

a) Enabling pupils to think of their achievements in objective terms. 

h') Giving pupils satisfaction for the progress they make, rather than 
for the relative level of achievement they attain. 

c) Enabling pupils to compete with their past performance record. 

^) Measuring achievement objectively in terms of accepted educa- 
tional standards, rather than by the subjective appraisal of 

4. To direct and motivate administrative and supervisory efforts by: 

a) Enabling teachers to discover the areas in which they need super- 
visory aid. 

h^ Affording the administrative and supervisory staff an over-all 
measure of the effectiveness of the school organization and of the 
prevailing administrative and supervisory policies. 
However, such achievement test batteries are too general to be used as 


a basis for instruction even when detailed analysis of items is made; al- 
though such an analysis has value and is certainly not to be discouraged. 
The sampling of items is too limited and the organization too gross for 
such tests to be considered as adequate guides in the planning and direct- 
ingof educational experiences for individual pupils. 

TTwo approaches have been made to the problem of determining status 
in a learning area as a basis for further instruction. One is commonly 
called "readiness testing" and is represented by reading readiness tests and 
arithmetic readiness tests at the elementary school level and by aptitude or 
prognostic tests in algebra, geometry, and language at the secondary school 
and college levels. Such tests are based on an analysis of the learnings 
essential to satisfactory progress in a predetermined instructional sequence. 
Too frequently the emphasis is placed on discovering pupils vv'ho cannot 
profit from a given course of instruction rather than on determining the 
optimum course for each pupil. That is, readiness is considered as some- 
thing to wait for, rather than something that should be developed. Instead 
of adjusting the curriculum to the individual, individuals are sorted in 
terms of an inflexible curriculum. 

The other approach to determining status in a learning sequence is 
labeled "diagnostic testing." Such tests are administered after a period 
of instruction to determine points of faulty or inadequate learning in a 
detailed and analytical manner with y view to correction. Superior teachers 
constantly carry on the process of checking learning through direct ob- 
servation of behavior and informal testing. The values of expertly pre- 
pared tests for this purpose are that: (1) They are more thoroughly 
analytical than most teachers are able to prepare. (2) They make the 
teachers aware of the important elements, necessary sequences, and diffi- 
culties of the process. (3) They save the teacher's time and energy in 
diagnosis and leave more for individual remedial work. (4) They help the 
pupil recognize his learning needs by systiematically emphasizing his errors. 
( 5 ) Remedial procedures are usually suggested or provided which save the 
teachers time and also aid in systematizing the processj 

In order that diagnostic tests may be most effective, they should meet 
the following general criteria: (1) They must be an integral pkrt of the 
curriculum, emphasizing and clarifying the important objectives. (2) The 
test items should require responses to be made to situations approximating 
as closely as possible the functional. (3) The tests must be analytical and 
based on experimental evidence of learning difficulties and misunderstand- 
ings. (4) The tests should reveal the mental processes of the learner 
sufficiently to detect points of error. ( 5 ) The tests should suggest or pro- 


vide specific remedial procedures for each error detected. (6) The tests 
should be designed to cover a long sequence of learning systematically. (7) 
The tests should be designed to check forgetting by constant review of 
difficult elements, as well as to detect faulty learning. (8) Pupil progress 
should be revealed in objective terms. Diagnostic tests frequently deal 
almost exclusively with the more mechanical aspects of a learning sequence 
and neglect the higher abilities requiring relational thinking and problem- 
solving. This is defensible in that efficiency in learning the more mechani- 
cal aspects leaves more time for the development of the higher mental 

The considerable proportion of elementary school time devoted to 
diagnostic testing and remedial teaching in the modern school is not gen- 
erally recognized. Almost all the time devoted to the formal teaching of 
spelling and handwriting is consumed in the search for, and correction of, 
specific errors. In arithmetic, reading, and the mechanics of English many 
modern schools devote no less than one-fifth of the time allotted for the 
development of skills to finding the specific deficiencies of individual pupils 
and correction procedures. Many such tests are pupil-scored, and self- 
diagnosis is stimulated. 

Perhaps the term "diagnostic testing" is more justified in its use in 
connection with the treatment of the more severe cases of maladjustment 
and arrested development which require insight and treatment of a more 
complex nature. Here the learning difficulties are more general, more com- 
plex, and more difficult to locate. The object of concern is not so much 
with what to learn as with factors limiting learning in general or in a 
specific area. Understanding is sought by studying the physical, intellectual, 
emotional, educational, and environmental factors to discover reasons for 
the maladjustment. 

Functions of Measurement in the Motivation of Learning 

A motivating condition has three functions in the learning process : ( 1 ) 
the energizing function, to increase the general level of activity and effort; 
(2) the directive function, to direct the variable and persistent activity of 
the organism into desirable channels; and (3) the selective function, to 
determine the responses which will be fixated and the responses that will 
be eliminated. [Testing procedure properly conceived and executed places 
the control of the learning process within the educator's power as no other 
teaching device does. The three functions-sof a motivating condition are 
inherent in the test situation and are imporrant criteria in the evaluation 
of measurement procedure?"^ C^ ^ ^J 


Energizing function. — The extent to which examinations increase the 
general level of learning activity and effort is attested by the cramming 
sessions in high school and college which precede examination periods. 
In many schools the examination is the payoff. Success or failure depends 
upon them. The examinations determine to a large extent when students 
study, what they study, and how they study. Each instructor is scrutinized 
throughout his course to determine what he will emphasize and the type of 
questions he will ask in the final. Unless the examinations truly measure 
the real objectives of the course, the value of such motivation may be 
questioned. The use of such tests as a basis for relative grading in the 
schools attended by all children may be criticized in the light of what 
has been said previously in this chapter. 

It has been reasoned that if tests increase effort, the more frequent the 
testing the greater the total effort. It is probable that the optimum fre- 
quency of testing depends upon the nature of the tests, the method of 
teaching, whether pupils score their own or each other's tests, the subject 
matter, and the ability of the students. When given too often, the relative 
importance of each test is reduced. Daily examinations probably have but 
little more energizing effect than daily assignments. It has been found 
[18, 32, 57, 17) in a variety of fields at the college level that when tests 
are given weekly and the results discussed, individual errors noted, and 
the final examination made up of similar questions, the lower-ability stu- 
dents achieve more than with less-frequent examinations. However, the 
more able students may be retarded by this process unless there are in each 
test, items of sufficient difficulty to challenge their ability. If, as is fre- 
quently true, the examinations cover only,^ minimum essentials, it is pos- 
sible they would profit more from the additional material which could 
be covered if less time were devoted to testing and remedial work. It is 
probable that the advantages of frequent testing even for the less able are 
the result of directing their learning and selecting the right responses 
rather than of an increase in energy stimulated by the examinations. 

The extent to which knowledge of success on examinations motivates 
learning has been carefully investigated. In one of the earliest and most 
carefully controlled experiments Panlasiqui {29) administered twenty 
weekly fifteen-minute tests in mixed fundamentals of arithmetic to 358 
matched pairs of fourth-grade pupils. In the experimental classes the 
amount of progress from week to week was stressed. Progress charts for 
both the individual and the class were kept, studied, and discussed. In the 
control classes no emphasis was placed on how much pupils scored or 
progressed. The difference between the achievement of the two groups 


over the twenty-week period was statistically significant. Knowledge of 
progress was beneficial in relation to the amount of progress. The more 
able pupils profited most. Those who made little progress, of course, were 
little stimulated by it. 

The influence of general praise and general reproof following an exami- 
nation has been investigated by Hurlock {16). The results indicate that, in 
general, praise is more stimulating than reproof. However, praise is most 
effective with the immature and less able student, reproof is generally 
most effective with the mature and competent student. Either praise or 
reproof is more stimulating than no comment. 

In general, the energizing effect of examinations depends largely upon 
the degree to which students are successful in them. Those who make high 
marks, who progress and receive praise are stimulated. Those who make 
low marks, who do not progress and are reproved become discouraged. The 
necessity of establishing reasonable educational goals, with at least partial 
success within the reach of all, seems indicated. 

Directive function. — It is difficult to overemphasize the importance of 
examination procedure in determining what teachers teach and how they 
teach — what pupils learn and how they learn. The facts are well established. 

In regional, state-wide, and local school testing programs where schools, 
teachers, and pupils are rated to some extent by the test results, the nature 
of the tests largely determines the quality of the schooling process. When 
tests are imposed upon the teacher and pupils, with important quality 
judgments involved, they become powerful instruments for determining 
educational goals and methods. The effects may be good or bad, depending 
upon the nature of the tests. 

If the tests are based on traditional curriculum materials of the factual 
type, designed to determine the amount of textbook material which has 
been memorized, they have the effects ascribed to them by Douglass {11) 
of: (1) artificially determining objectives and method and of "freezing" 
the curriculum, (2) encouraging memorizing, cramming, regimentation, 
and mechanization, (3) reducing the teacher to the status of a tutor, (4) 
emphasizing only those outcomes that can be measured by objective tests, 
and ( 5 ) standardizing the curriculum, preventing adaptation to local needs 
and the further evolution of educational procedure. However, if the tests 
measure important study skills and problem-solving abilities, emphasize 
generalizations and their application to new situations, clarify important 
and neglected objectives, and focus attention on the ultimate objectives of 
education — the permanent learnings — then they may have the general effect 
of "thawing out" the curriculum of the schools, and stimulating more 


acceptable teaching methods, objectives, materials, and learning procedures. 

The influence of examinations on study procedures and objectives has 
been well established {35, 12, 36, 26). In preparing for essay-type exami- 
nations, students tend to outline and organize material systematically in 
large units, emphasizing relationships, trends, and personal reactions. In 
preparing for the broad sampling type of factual objective examinations, 
students emphasize factual details, names, dates, and results of specific 
experiments. A difference was found in the way students prepare for dif- 
ferent types of objective tests. When preparing for completion tests, for 
example, students attempt a word-for-word mastery of important state- 
ments. In preparing for true-false tests, definitions and detailed facts tend 
to be emphasized. Differences in what is learned, the amount retained over 
a period of time, and the ability to take different types of tests have been 
experimentally related to the type of test a student prepares himself to 

Certainly one of the most important criteria of the value of an achieve- 
ment test is the degree to which it directs teaching and learning procedures 
into desirable channels resulting in the achievement of the most acceptable 

Since comprehensive and systematic testing programs direct teaching 
procedures and learning toward the objectives measured, there is danger 
of warping the curriculum in the direction of measured objectives. The 
necessity of measuring all important objectives is indicated. 

Selective junction. — The extent to which tests help fixate correct and 
desirable behavior and eliminate errors depends not only upon the nature 
of the instrument but also upon how it is scored, and upon the emphasis 
placed on individual errors and remedial work in the follow-up procedure. 

The selective function of measurement is related to the diagnostic func- 
tion dealt with in a previous section of this chapter. In diagnostic testing 
emphasis is placed on diagnosis by the teacher or the educational psycholo- 
gist. Here the concern is more with the student and teacher obtaining direct 
knowledge of errors (and, if possible, the reason for these errors) from 
the test situation. 

Maximum learning results from testing when students are permitted 
to score their own papers and discussion of errors and remedial work 
follow immediately. Little {23) reports a well-controlled investigation in- 
volving the selective function of tests utilizing fourteen sections of a course 
in educational psychology averaging thirty students each. Four experimental 
sections took twelve unit tests during the quarter. The tests were im- 
mediately scored by machine, returned to the student, and the errors dis- 


cussed in class. Four other experimental sections used a drill-machine 
placed on the desk of each student. The same twelve tests were taken on 
the machine. Each student indicated his answer to each question by press- 
ing one of five keys. If the answer was correct the drum of the machine 
turned up the next question; if incorrect the same question remained be- 
fore the student until the correct key was punched. The score was the total 
number of tries made in completing the test. The six control sections took 
only the pretest, mid-term, and final, which were administered to all sec- 
tions. Both experimental groups showed superiority over the control 
•sections on the final examination, with differences significant at the 1 per- 
cent level. The superiority of the drill-machine group was approximately 
twice that of the machine-scored group. 

Logic and experimental evidence indicate that in the test situation the 
more immediate and direct the student's knowledge of when and why he 
is correct and when and why he is incorrect, the greater the tendency to 
fixate the correct responses. 

Functions of Measurement in the Development and Maintenance 

of Skills and Abilities 

When test items approximate the situations in life in which the learning 
will function, each item is not only an excellent test item, but a good 
teaching question as well. Such an examination becomes an efi^ective learn- 
ing device because it requires thought and problem-solving effort, and not 
mere recall or recognition of previous learning. Because of the additional 
motivation resulting from the test situation, it is probable that more learn- 
ing takes place during the examination period than during any other equal 
period of learning time. 

Recognizing the potentialities of tests for stimulating learning, especially 
in the skill and problem-solving areas (notably reading, mathematics, 
science, and the mechanics of English) series of drill material tests have 
been constructed as aids in developing and maintaining skills. Although 
the measurement function of these drill-tests is important from the stand- 
point of measuring progress and motivation, this function is really sub- 
sidiary to that of developing and maintaining skills and problem-solving 

The general characteristics of such drill-tests are: (1) the maintenance 
program is integrated with the development program; (2) the temporal 
distribution of drill is controlled to give a maximum of maintenance with 
a minimum of time; (3) each drill is graduated in difficulty; (4) progress 
records and/or charts are provided; (5) drill is provided on all aspects 


of the skill; (6) specific difficulties reoccur at regular intervals; (7) diag- 
nosis of error and remedial work are provided for; (8) several skills or 
abilities may be required by each test, but they are limited in scope and 
specific in nature. 

Such materials have been available in the better-planned workbooks and 
in the form of self-testing drills for over twenty years. When teachers 
have insight into the nature of their construction and of their proper func- 
tion in the learning process, they have been accepted as valuable teaching 
aids. To be valuable, the design of such material must be based on experi- 
mental evidence and adjusted to the ability of the student (several levels 
of difficulty should be used at a given grade level), and progress should 
be recognized in units more sensitive than age or grade scores. The danger 
is in overusing such materials to the extent that only the mechanics of a 
process are emphasized. When time is saved by these drill-tests in the more 
mechanical aspects of a skill area, it should mean that more time can be 
spent on reasoning, insights, understanding, applications, and the higher 

Educational Measurement and the Improvement of 
Educational Practice 

The variability in aptitudes of those we teach and the educational pro- 
cedures suggested to secure a maximum of learning and development have 
been the burden of this chapter. To those who know well the schools, the 
teachers, and school budgets, the proposals may seem visionary and im- 
practical. To those who know schools largely in the abstract, the recom- 
mendations may seem commonplace. 

Much could be said and more should be known concerning variability in 
the competencies of teachers. One million among the sixty million of 
employed people in the United States are teachers. That this group is 
highly selected with reference to the qualities required for teaching is im- 
probable. There is evidence that those enrolled in many teachers colleges 
represent the "mine run" of high school graduates in academic competence. 
Some colleges with students below this level have been found. It is probable 
that in the majority of classrooms above the primary level high-ability 
pupils will be found who exceed their teachers in general intelligence, 
reading comprehension, and problem-solving ability. In teaching, as in the 
other professions, there will never be enough high-level talent available 
to fill the positions. 

The low relationship commonly found between the competence of a 
teacher in a given field of learning and the achievement of his students 



in that field has served to salve our concern. But a more adequate analysis 
of such data indicates that, in general, the high-level students achieve the 
most with a high-level teacher and that low-level students achieve the most 
with a low-level teacher (33). This suggests that through measurement 
we may ultimately be able to place the student with the right teacher as 
well as give him the right book and set of problems. 

The influence of the social climate of the classroom on academic achieve- 
ment is still an area for investigation, but its influence on social behavior 
both inside and outside the classroom has been documented (21). It has 
also been shown that the social climate which a teacher habitually main- 
tains in his classroom is highly related to the teacher's attitudes toward 
students, the school, fellow-teachers, and teaching procedures {6, 20). 
These attitudes have been successfully measured. By knowing a teacher's 
score on an attitudes inventory, it is possible to predict the social climate 
he will maintain in the classroom. The resistance of these attitudes to edu- 
cation and teaching experience indicates sufficient stability for purposes 
of teacher selection. 

Although progress in the more appropriate selection, education, and 
placement of teachers may be anticipated, it will still be true that a high 
proportion of teachers will lack adequate insight into the educative process. 
The hope here must lie in improved materials of instruction. The ad- 
ministration and scoring of educational tests have been simplified to the 
level of a clerical task, but the interpretation and appropriate procedures 
that should follow testing require high-level ability and frequently extra-^ 
ordinary effort and determination. In most schools the needed instructional 
materials are not available. The teacher with the know-how is frequently 
forced to construct materials or to improvise with what can be found. For 
example, it is relatively easy to determine the reading comprehension levels 
of pupils in a sixth-year class, but to find materials of appropriate diffi- 
culty on the proper subjects with the proper interest appeal is extremely 
difficult and in most schools impossible. The schools have a long way to 
go in providing materials and equipment necessary for meeting the needs 
revealed by educational tests. 

In summary, it may be said that jeduc ational measurement has taken a 
leading role in analyzing and refining educational objectives with reference 
both to their ultimate importance and to the most efficient and appropriate 
learning experiences. It has furnished the means of clarifying these objec- 
tives to both the teacher and the learner. It affords the means of determin- 
ing the status of the learner in each learning area, suggesting appropriate 
experiences. It has emphasized the importance of the how in learning as 


well as the mhat, placing study habits and intellectual skills in proper 
perspective. It has focused attention on child development and intellectual 
growth, diverting it from the subject-matter-to-be-covered point of view. 
It not only motivates the learner but also directs his efforts into efifective 
channels. It furnishes the school official with a more adequate conception of 
his responsibility and serves as a guide to a more effective school organiza- 
tion and the selection of educational materials and facilities. Truly, educa- 
tional measurement h as ou trun educational practice, but its leadership is 
wholesome and effective. 1 

Selected References 

1. Anastasi, a. "Practice and Variability: A Study in Psychological Method," Psycho- 
logical Moitographs, 45: 1-55, 1934. 

2. Anastasi, A., and Foley, John P., Jr. Differential Psychology. New York: Macmillan 
Co., 1949. Chaps. 14, 15. 

3. Buckingham, B. R. "The Greatest Waste in Education," School and Society, 24: 
653-58, 1926. 

4. Burr, Marvin Y. A Study of Homogeneous Grouping. ("Contributions to Educa- 
tion," No. 457.) New York: Teachers College, Columbia University, 1931. 

5. Cook, Walter W. Grouping and Promotion in the Elementary School. ("Series on 
Individualization of Instruction," No. 2.) Minneapolis: University of Minnesota Press, 

6. . "Measuring the Teaching Personality," Educational and Psychological Measure- 
ment, 7: 399-410, Autumn 1947. 

7. . "Some Effects of the Maintenance of High Standards of Promotion," Ele- 
mentary School Journal, 41: 430-37, Feb. 1941. 

8. Cornell, Ethel L. The Variability of Children of Different Ages and Its Relation 
to School Classification and Grouping. ("University of the State of New York 
Bulletin," No. 1101; "Educational Research Studies," 1937, No. 1.) 98 pp. 

9. Coxe, W. W. Levels and Ranges of Ability in Neu> York State High Schools. ("Uni- 
versity of the State of New York Bulletin," No. 1001; "Educational Research Studies," 

10. . Study of Pupil Classification in the Villages of New York State. ("Uni- 
versity of the State of New York Bulletin," No. 841; "Educational Research Studies," 

11. Douglass, H. R. "The Effects of State and National Testing on the Secondary School," 
School Revieta, 42: 497-509, 1934. 

12. Douglass, H. R., and Tallmadge, Margaret. "How University Students Prepare for 
New Types of Examinations," School and Society, 39: 318-20, March 10, 1934. 

13. Hawkes, H. E.; Lindquist, E. F.; and Mann, C. R. The Construction and Use of 
Achievement Examinations. Boston: Houghton MifHin Co., 1936. Pp. 23-26. 

14. HoLLiNGSHEAD, A. D. An Evaluation of the Use of Certain Educational and Mental 
Measurements for the Purpose of Classification. ("Contributions to Education," No. 
302.) New York: Teachers College, Columbia University, 1928. 

15. Hull, Clark L. "Variability in Amount of Different Traits Possessed by the Indi- 
vidual," Journal of Educational Psychology, 18: 97-104, 1927. 

16. Hurlock, Elizabeth B. "An Evaluation of Certain Incentives Used in School Work," 
Journal of Educational Psychology, 15: 145-59, March 1925. 

17. Keys, Noel. "The Influence on Learning and Retention of Weekly Tests as Opposed 
to Monthly Tests," Journal of Educational Psychology, 25: 427-36, Sept. 1934. 

18. Kirkpatrick, James Earl. "The Motivating Effect of a Specific Type of Testing 
Program," University of loiva Studies in Education, 9: 41-68, June 15, 1934. 


19. Learned, William S., and Wood, Ben D. The Student and His Knou>ledge. 
("The Carnegie Foundation for the Advancement of Teaching Bulletin," No. 29.) 
New York: The Foundation, 1938. 

20. Leeds, Carroll H., and Cook, Walter W. "The Construction and Differential Value 
of a Scale for Determining Teacher-Pupil Attitudes," Journal of Experimental Edu- 
cation, 16: 149-59, Dec. 1947. 

21. Lewin, K.; Lippitt, R.; and White, R. K. "Patterns and Aggressive Behavior in 
Experimentally Created "Social Climates,' " Journal of Social Psychology, 10: 271-99, 

22. LiNDQUiST, E. F., et. al. Manual for Adtninistration and Interpretation of 1938 Iowa 
Every-Pupil Tests of Basic Skills. Iowa City: Bureau of Educational Research and Serv- 
ice, State University of Iowa, 1938. P. 23. 

23. Little, J. Kenneth. "Result of Use of Machines for Testing and for Drill, Upon 
Learning in Educational Psychology," Journal of Experimental Education, 3: 45-49, 

24. McCall, William A. Measurement. New York: Macmillan Co., 1939. Chap. 11. 

25. McNemar, Quinn. The Revision of the Stanford-Binet Scale. Boston: Houghton 
MifHin Co., 1942. 

26. Meyer, George. "An Experimental Study of the Old and New Types of Examina- 
tion," Journal of Educational Psychology. 25: 641-61, Dec. 1934; and 26: 30-40, 
Jan. 1935. 

27. National Society for the Study of Education. Thirty-fourth Yearbook: Edu- 
cational Diagnosis. Chicago: University of Chicago Press, 1935. 563 pp. 

28. — ■ . Thirty-fifth Yearbook, Part I: The Grouping of Pupils. Chicago: University 

of Chicago Press, 1936. 319 pp. 

29- Panlasiqui, Isidoro. "The Effect of Awareness of Success on Skill in Arithmetic." 
UnpubHshed Doctor's dissertation. State University of Iowa, 1928. 

30. Peterson, J., and Barlow, M. C. "The Effects of Practice on Individual Differences, " 
Twenty-seventh Yearbook of the National Society for the Study of Education, Part II: 
Nature and Nurture: Their Influence upon Achievement. Bloomington, 111.: Public School 
Publishing Co., 1928. Pp. 211-30 (chap. 14). 

31. Reed, H. B. "The Influence of Training on Changes in Variability in Achievement," 
Psychological Monographs, 41: 1-59, 1931. 

32. Ross, C. C, and Henry, Lyle K. "The Relation between Frequency of Testing 
and Progress in Learning Psychology," Journal of Educational Psychology, 30: 604-11, 
Nov. 1939. 

33. Steele, Lysle Hugh. "A Study of the Relationship between Teacher Proficiency in 
the Basic Skills and Pupil Achievement in the Same Skills." Unpublished Master's 
thesis. University of Minnesota, Aug. 1947. 

34. Terman, Lewis M., and Merrill, Maud A. Measuring Intelligence. Boston: Hough- 
ton Mifflin Co., 1937. 

35. Terry, Paul W. "How Students Review for Objective and Essay Tests," Elementary 
School Journal, 33: 592-603, April 1933. 

36. ■ . "How Students Study for Three Types of Objective Tests," Journal of Edu- 
cational Research, 27: 333-43, Jan. 1934. 

37. Turney, Austin H. "The Effect of Frequent Short Objective Tests upon the 
Achievement of College Students in Educational Psychology," School and Society, 
33: 760-62, June 6, 1931. 

38. Tyler, R. W. "Permanency of Learning," Journal of Higher Education, 4: 203-4, 

39. Wert, J. E. "Twin Examination Assumptions," Journal of Higher Education, 8: 
136-40, 1937. 

40. Wood, Ben D. "The Need for Comparable Measures in Individualizing Education." 
Educational Record, 20: 14-31, Tan. 1939. 

2. The Functions of Measurennent in 
Innproving Instruction 

By Ralph W. Tyler 

University of Chicago 

Collaborators: W. W. Charters, formerly of Ohio State University; 
Henry S. Dyer, Harvard University ; Ruth E. Eckert, University of Minne- 
sota; J. W. Wrightstone, Neiv York City Schools 

Since the purpose of this chapter is to outline the ways in which 
educational measurement, that is, achievement testing, can serve to improve 
instruction, we shall consider first what steps are involved in an effective 
program of instruction and then indicate the contributions that achieve- 
ment, testing can make to each of these steps. In this connection it will be 
noted that educational measurement is conceived, not as a process quite 
apart from instruction, but rather as an integral part of it. After discussing 
the contributions that achievement testing can make in the planning and 
conduct of an instructional program, the chapter then treats the ways in 
which educational measurement can contribute to effective supervision of 
instruction and to administration. 

What Is Instruction? 

The relationship between achievement testing and instruction can be 
seen more clearly by noting the nature of the instructional process. Basically, 
instruction is the process by which desirable changes are made in the 
behavior of students, using "behavior" in the broad sense to include think- 
ing, feeling, and acting. Instr uction is not effective , therefore, unless 
sojiie^changes in t he behavior of stu dents have actually tak en__place. Thus, 
in a certain English course one purpose of instruction may be to develop 
increased skill in writing clearly and in well-organized fashion; another 
purpose may be to develop increased ability to read and to interpret novels. 
English instruction in such courses will not have been effective unless 
these kinds of behavior changes have actually taken place in the pupils — 
unless the students do become more skillful in writing and are better able 
to read and interpret novels. To take a second illustration, instruction in a 



certain science course may be aimed at developing an understanding of 
certain important science principles, an ability to utilize these principles 
in explaining common scientific phenomena, and some skill in analyzing 
scientific questions to see the kinds of data needed to solve them. In- 
struction in this science course will not have been effective unless changes 
of these kinds actually take place in the students. 

Unless instruction is to be micrely a haphazard or intuitively guided 
process, it requires rational planning and execution in terms of the plans. 
Viewed in this way, instruction involves several steps. The first of these 
is to decide what ends to seek, that is, what objectives to aim at or, stated 
more precisely, what changes in students' behavior to try to bring about. 
The s econd step is to determine what content and l earning experiences can 
be used that are likely to attain these ends, these changes in student be- 
havior. The third step is to determine an effective organization of these 
learning experiences so that their cumulative effect will be such as to 
bring about the desired behavior changes in an efficient fashion. Finally, 
the fourth step is to appraise the effects of the learning experiences to find 
out in what ways they have been effective and in what respects they have 
not produced the results desired. Obviously, this fo urth step is educa - 
tional measurement, or achievement testing. It is an essential part of instruc- 
tion because without appraisal of the results being attained, the instructor 
has no adequate way of checking the validity of his judgments regarding! 
the values of particular learning experiences and the effectiveness of their] 
organization in attaining the ends of education. 

In appraising the effects of the learning experiences today we not only 
test but also evaluate. "Evaluation" designates a process of appraisal which 
involves the acceptance of specific values and the use of a variety of in- 
struments of observation, including measurement, as the bases for value- 
judgments. From the point of view of its functions it involves the identifi- 
cation and formulation of a comprehensive range of major objectives of 
a curriculum, their definition in terms of pupil behavior, and the construc- 
tion of valid, reliable, and practical instruments for observing the specific 
phases of pupil behavior such as knowledges, information, skills, attitudes, 
appreciations, personal-social adaptability, interests, and work habits. Any 
learning situation has multiple outcomes. While the child is acquiring 
information, knowledges, and skills, there are also taking place concomi- 
tant learnings in attitudes, appreciations, and interests. This view indicates 
a shift from a narrow conception of subject-matter outcomes to a broader 
conception of growth and development of individuals. 


Measurement Helps in Selecting Objectives 

Educational measurement is not only the fourth phase of the instructional 
process, but it also contributes to the first phase, that is, the selection of 
the objectives to be sought. This contribution can be seen by examining the 
step of selecting educational objectives. Since it is possible to bring about 
a great many changes in behavior through the process of instruction, since 
some of these changes would be quite undesirable (for example, teaching 
children to steal) , and since the time which the school can devote to instruc- 
tion is relatively small and does not permit the attainment of all the possible 
desirable objectives, it would seem evident to an intelligent observer that 
every school and every instructor needs to determine what objectives shall 
be aimed at and to select a small enough number so that they can be at- 
tained with some degree of success. Although this argument for a careful 
selection of objectives seems obvious, as a matter of fact, a great many 
schools and teachers carry on the process of instruction without having a 
clear conception of the ends to be reached. It is true in schools and colleges 
as in other social institutions that forms, materials, and procedures become 
traditional and are passed on from one generation to another without a 
clear realization of the ends they are expected to achieve. Hence, many 
of us find ourselves using particular books and carrying on particular kinds 
of instructional procedures, not so much because we have certain ends in 
mind, as because these materials and these procedures have been used for 
many years and, possibly, for many generations and are accepted as of 
value without any clear perception of their underlying purposes. 

Hence, people may carry on instruction without having clearly formulated 
the objectives. Yet, if instruction is to be rationally planned and choices are 
to be made among various possible materials and procedures of instruction, 
there must be a clear understanding of the ends to be sought. 

The selection and clarification of objectives can be facilitated by carry- 
ing on a program of educational measurement. It is not possible to con- 
struct a valid achievement test, or to use one properly, without clarifying 
the objectives which the test is supposed to measure. One cannot measure 
the outcomes of a course in English without knowing what particular 
changes in behavior are sought in the English course since the test is a 
device for determining whether or not these changes have actually occurred. 
If the English course seeks to develop skill in organizing written material, 
then it takes a different kind of test than does a course which is aiming at 
developing a knowledge of certain types of literary materials or certain 
skills in reading, or certain feeling responses to novels. The necessity of 


having objectives clearly formulated, that is, stated in specific terms rather 
than in terms of vague generalities, stimulates the instructional stafiF to 
attack the problem of objectives and to carry the analyses to a degree of 
definiteness and clarity that would not be likely to occur if the testing 
problem were not uppermost in mind. Hence, educational measurement 
can help in selecting and clarifying educational objectives by stimulating 
the faculty to formulate their objectives and to express them clearly in 
terms of behavior. 

Selecting and defining objectives in operational terms does not in itself 
guarantee a wise choice of educational ends. Although the selection of 
goals on the part of school, college, or individual teacher is a matter of 
choice in the light of cherished values rather than a process of objective 
recognition, there are types of data that can be obtained by the school, 
college, or instructor that will provide bases for wiser decisions than when 
the choice of goals is made without such information. These include: (1) 
data regarding the students themselves, their present abilities, knowledge, 
skills, interests, attitudes, and needs; (2) data regarding the demands 
society is making upon the graduates, opportunities and defects of con- 
temporary society that have significance for education, and the like; (3) 
suggestions of specialists in various subject fields regarding the contribu- 
tions they think their subjects can make to the education of students. 

It should be clear that the ends to be aimed at in a particular school or 
a particular course should be ends not already attained by the student, but 
goals that can be built upon his previous background of skills, abilities, 
knowledge, attitudes, and interests, and objectives that to some degree help 
the student to deal with his own problems, to satisfy his interests, to meet 
his needs. It is also clear that the ends of education in a particular school 
or course should be to some degree those knowledges, skills, abilities, 
attitudes, and interests that have significance in contemporary society, and 
help to carry on an effective civilization and improve it. Hence, to provide 
suggestions about objectives like these, data regarding the students and 
about contemporary society are both helpful. Furthermore, specialists who 
have thought and worked intensively in a particular subject field sometimes 
get significant insights regarding the contribution their field can make 
to the education of children or youth. These suggestions are most likely 
to be found in reports of committees on the curriculum or teaching of the 
several subjects. These reports can be helpful in suggesting objectives. 

The three kinds of data listed will suggest more objectives than any 
one school or course can possibly attain. Furthermore, some of these objec- 
tives will be conflicting in their nature. It is necessary to make a choice 
among them. In making such a choice, the philosophy of education held 


by the school or the instructor serves as a guide to identify those objectives 
of greatest value in terms of the conception of the "good life" and the 
"good society" implied by this philosophy. The school will wish to em- 
phasize objectives that are most in harmony with its philosophy. 

Another consideration in choosing objectives is the findings of studies 
in the psychology of learning. Objectives that are not likely to be attainable 
according to the psychology of learning will be omitted as will objectives 
that are more appropriately developed at some other stage in the students' 

The foregoing is a brief outline of the procedure involved in a rational 
process of selecting objectives. In this process, educational measurement 
serves particularly by providing data about the students that suggest edu- 
cational objectives. For example, in formulating objectives for a ninth- 
grade course in mathematics, all the students completing the eighth grade 
were tested in terms of their arithmetic skills, their ability to analyze and 
to solve a variety of commonplace problems involving quantitative aspects, 
and their understanding of certain basic mathematical concepts. The weak- 
nesses shown in problem-solving and in understanding mathematical con- 
cepts suggested that these were objectives to be emphasized in the ninth- 
grade course. 

To cite another example, the instructors in a freshman college course 
in music gave a battery of tests during the freshman orientation week to 
get suggestions regarding objectives for the music course. The tests in- 
volved ability to recognize similar musical themes, to identify types of 
musical instruments in various parts of orchestral records, to distinguish 
changes in pitch and in rhythm; they also involved knowledge of musical 
history and familiarity with various classical and popular works. Relatively 
low scores made by students on tests assessing recognition and understand- 
ing of musical recordings suggested the need for emphasizing as an objec- 
tive "the development of ability to hear and interpret music." 

These two examples are cited here to illustrate the use of educational 
measurement as a part of the process of rational selection of educational 
objectives. Educational measurement thus contributes in two ways to this 
first step in the process of instruction. It stimulates the instructional staff 
to select and to define its objectives clearly and it provides a tool of value 
in obtaining data to be used in making a wise selection of objectives. 

Measurement Helps in Selecting Content, Learning Experiences, 
and Procedures of Instruction 

The second major step in the process of instruction is to select content 
and learning experiences and to plan procedures of instruction that are 


likely to attain the ends sought, that is, the objectives that have now been 
selected and clearly defined. Educational measurement can contribute to 
this step in several ways. In the first place, the results of achievement tests 
given at the close of the preceding grade or course or at the beginning of 
the current one provide a basis for j udging the^jyarious levels at which 
st udents are ready to procee d_toward the attainment of each major objective 
and, therefore, the several levels of content and the kinds of instructional 
materials and procedures likely to be effective. For example, if one impor- 
tant objective of the English course is to develop ability to organize written 
materials effectively, a test of ability to organize given at the beginning of 
the course or at the close of the preceding course will indicate to what 
degree the various students have already attained some skill in organiza- 
tion and whether many of them are now ready to proceed with more com- 
plex organizational problems or should begin with much more elementary 
organizational problems in their writing. If another objective of the Eng- 
lish course is to develop the ability to interpret literary material with skill 
and understanding, tests given at the beginning will indicate what kinds 
of interpretations the students are now able to make and, therefore, what 
kinds of materials are appropriate for them in developing further skills 
and understanding in interpretation. Similar illustrations could be drawn 
from other fields. For example, in the case of a certain science course, if 
the objective is to develop some ability to apply important principles of 
science to common concrete phenomena, an examination given early in the 
course should indicate what principles the students already understand and 
to what kinds of concrete problems they can apply these principles. It is 
possible from the data to make inferences regarding the range of the 
appropriate content to be used with this group of students and the various 
instructional procedures that are likely to be effective with them. 

The second way in which the results of achievement tests can be used 
to guide the sel ection of content and procedures of instruct ion is through 
the evidence they furnish regarding the effectiveness of particular content 
and procedures that have been tried out experimentally. When these experi- 
ments are properly designed and controlled, the results provide some basis 
for judging the kind of content and the kind of instructional materials 
which are effective means for attaining the instructional goals. The results 
of the tests also serve as criteria for testing the hypotheses upon which 
the selection of content and instructional procedures was based. In some 
cases the results may indicate that the materials and procedures are highly 
effective. In other cases students may show little or no progress toward 
the educational goals. In most cases there will be some respects in which 
satisfactory results are being attained and other respects in which results 


are far from satisfactory. In the latter cases, such results of testing provide 
a basis for restudy, replanning, and modification in the content and in- 
structional procedures. Detailed analyses of the test results may help to 
indicate whether the failure to attain the results desired is due primarily 
to poor material or poor instructional procedures, or whether the very 
principles or hypotheses upon which the instructional plan was based may 
be invalid. This use of test results provides a rational basis for the con- 
tinued revision and improvement of the content and learning experiences. 

An example may illustrate this contribution of measurement more con- 
cretely. A certain course in science had as one of its aims the development 
by the student of the ability to use science principles in explaining some 
of the common natural phenomena. As the course was first conducted, an 
effort was made to include materials which described a variety of natural 
phenomena and to give the student a chance through field trips and labora- 
tory work to come in direct contact with many of these phenomena. It was 
found by testing as the course went on that the students gained a great 
deal of information about concrete phenomena, but were not able to ex- 
plain them in terms of the appropriate scientific principles. Since the re- 
sults were disappointing in every unit of the course, it seemed likely that 
the poor results could not be attributed merely to a particularly poor selec- 
tion of materials or a particularly poor field trip or experiment. Rather, 
the results raised a question as to the validity of the hypothesis upon which 
the course was planned, namely that familiarity with natural phenomena 
and descriptions of them would provide the means for students to learn 
to explain these phenomena in terms of scientific principles. 

In case a basic hypothesis upon which instruction is planned is brought 
into serious question, it is desirable to compare results secured with those 
obtained when a different hypothesis is used. In this example, the in- 
structors tried out a second hypothesis, namely, that formulating scientific 
principles inductively and practice in using these principles in explaining 
natural phenomena would provide the means for students to learn to 
explain other natural phenomena in terms of scientific principles. Using 
this second hypothesis, the instructors selected material which required 
certain observations and stimulated the students to develop scientific prin- 
ciples inductively. Opportunities were also provided for students to apply 
these principles to concrete phenomena and to criticize and revise the 
explanations which they developed. Further tests on the basis of the revised 
instructional program indicated that there was a considerable improvement 
in the degree to which students were able to utilize scientific principles in 
interpreting the concrete phenomena around them. This is one of many 
possible illustrations of the use of educational measurement in checking 


on content and learning experiences, and, when necessary, revising them. 
Achievement testing at the beginning and at the end of a period of in- 
struction thus contributes to the effective selection of instructional content 
and learning experiences. 

Measurement Helps in Organizing Learning Experiences 

In organizing learning experiences, the purpose is to put together the 
various learning experiences in such a way that the cumulative effect is 
made much greater than that which would result from a haphazard organi- 
zation of them. We may note that the organization of learning experiences 
involves relationships both vertically, that is, from one week or one month 
to the next within the same subject or field, and also horizontally, that is, 
from one subject to another and from the school sector of the child's 
experience to the out-of-school sector within the same period of time. 
Many learning experiences are relatively ineffective as isolated aspects of 
a student's life, but when they are appropriately organized so that one 
experience builds upon another and reinforces the other, profound changes 
in the student's behavior result. Since instruction is generally concerned 
with producing significant changes rather than merely transitory modifica- 
tions in behavior, it is very important to devise schemes of organization 
which will increase the efficiency of the learning experiences and result in 
the maximum cumulative effect. 

From this brief definition of the problem of organization, three criteria 
emerge for an effective organization of instructional experiences, namely, 
continuity, sequence, and integration. Continuity refers to the provision 
of continuing emphasis upon the desired knowledge, skills, abilities, at- 
titudes, and the like over the months and years. For example, an effective 
curriculum in arithmetic will provide continuous opportunity for arithmetic 
computations so that these skills will not be developed and then permitted 
to wither away through lack of use. Similarly, a good curriculum will pro- 
vide continuing opportunity to use and apply such a basic concept as 
"transformation of energy" so that it will become part of the student's 
way of thinking about the physical world. 

The criterion of sequence refers to that arrangement of learning experi- 
ences in which each subsequent treatment of a particular concept, skill, 
attitude, or the like is not simply a repetition of former treatments, but 
deals with it more broadly or more deeply or at a higher level than the 
preceding ones. For example, sequence in the treatment of the concept of 
"cooperation" might begin with the idea of cooperation among a small 
intimate group including reference to physical needs, such as cooperation 
among family members in preparing and serving a meal. Subsequent treat- 


ments of cooperation could broaden the groups in which cooperation is 
involved, increase the types of cooperative relationship, and deepen the 
intellectual and emotional significance of cooperation. This would imply 
real sequence rather than mere continuing repetition of the original con- 
cept of cooperation. 

Correspondingly, sequence in the treatment of a skill like "map-read- 
ing" might begin with a simple chart which the pupils drew of their own 
classroom. As they learned how to interpret this chart, subsequent experi- 
ences with map-reading might involve a wide range of kinds of maps and 
a broader use of maps to make many kinds of interpretations. 

The criterion of integration refers to a relationship among the several 
subjects or parts of the curriculum as well as between school experiences 
and those out-of -school, all of which serve to provide mutual re-enforce- 
ment for the important learnings emphasized in each of the several seg- 
ments of the pupil's learning experiences. For example, the criterion of 
integration is involved when the attempt is made to provide opportunities 
in social studies and science for students to use the writing skills they 
are developing in the English class. The use of these skills in other classes 
helps to strengthen and render more meaningful the learnings gained in 
the English class. Likewise, attitudes of enjoyment of music developed in 
the home can be intelligently re-enforced by school experiences with music. 
These re-enforcements through continuity, sequence, and integration are 
very necessary in order to achieve an efficient organization that will provide 
a maximum cumulative effect of the various learning experiences of each 

To plan for an effective organization of learning experiences, it is neces- 
sary to identify the instructional elements to be organized and to develop 
some hypotheses or principles that provide a basis for arranging these ele- 
ments to produce the desired cumulative effect. Thus, in the case of a 
mathematics course it is necessary to identify the elements that are to be 
organized in the mathematics course. These are likely to be of two kinds — 
the first, concepts, that is, basic ideas or notions about mathematical mat- 
ters, and, second, skills, that is, the important intellectual operations that 
are involved in handling quantitative computations and problems. An 
illustration of a mathematical concept might be the idea of place in the 
number system; another might be the concept of continuity or discontinuity 
in a mathematical function. An illustration of a skill might be the skill of 
multiplying simple integers. Each of these is an element which can be 
introduced at a relatively simple level and carried on through increasingly 
complex and broader aspects. Hence, both concepts and skills are elements 
which permit of effective organization. Correspondingly, in a field such as 


social studies, the organizing elements are likely to include : ( 1 ) Concepts, 
such as the interdependence of all human beings. This concept can be 
introduced at a relatively simple level to small children as they see the de- 
gree to which they are dependent upon each other for various activities in 
the kindergarten and can be extended to increasingly broader and deeper 
ranges until mature students see the degree to which each citizen and each 
nation is dependent upon citizens and nations far removed from them. 
(2) Skills, such as skill in reading and interpreting social statistics and 
the like. These skills, too, can be introduced at a simple level and carried 
on to more complex operations. (3) Values, such as belief in the dignity 
and worth of every human begin. Each of these elements permits of ap- 
propriate organization to provide for continuity and sequence and to relate 
it more integrally with the rest of the student's experience, within and 
without the school. 

In addition to selecting the elements to be organized, it is necessary to 
make some decision about the principles of organization to be followed. 
Thus, in connection with the development of such a skill as addition, the 
principle of organization may be to begin with what is apparently simple 
and move on through what appears to be Increasingly complex. This is 
accomplished by beginning with the addition of two one-digit numbers, 
then proceeding to the addition of several one-digit numbers, then to the 
addition of two-, three-, and four-digit numbers, then to fractions, and 
ultimately to negative and imaginary numbers. The acceptance of such an 
organizing principle suggests the way In which this skill of addition is to 
be developed over the years. There is also the problem of the organizing 
principle to relate addition to other aspects of experience, for example, 
relating addition to subtraction, to multiplication, to division, and ulti- 
mately relating addition In mathematics to operations or concepts in other 
fields. Before a rationally planned program of Instruction can be developed, 
it is necessary to decide upon elements to be organized and possible prin- 
ciples of organization to be applied. 

Educational measurement can contribute to the process of organizing 
instruction In two ways. In the first place, the development of achievement 
tests for a particular course or instructional program requires an explicit 
statement about the elements of organization and the hypotheses which 
are accepted regarding the principles of organization. These explicit state- 
ments are necessary because the achievement test must be built so as to 
include these elements as aspects of the test. For example, In the case of 
mathematics, if concepts and skills are the two major kinds of elements, 
it is necessary for the achievement test to include items involving concepts 
and items involving skills. It is also necessary to indicate the principles of 


organization so tliat the test can have varying levels of items, some involv- 
ing concepts at a relatively simple level and others at an increasingly 
broader and deeper level. Similarly, items involving skills will have some 
at a relatively simple level and others at the more complex level, the defini- 
tion of simple and complex demanding some formulation of the organizing 
principles actually used in developing this mathematics course. 

Correspondingly, in the case of a social studies course it is necessary to 
identify the types of elements to be sure that the achievement test properly 
samples each of the types. If the social studies course involves concepts, 
skills, and values, then it is necessary that test exercises test for the under- 
standing of certain concepts, for competence in certain social studies skills, 
and for the kinds of values which the students cherish and accept for them- 
selves. It is also necessary to know the organizing principles for each of 
these elements, so that the test may provide opportunity for the students to 
show their ability to deal with increasingly more mature concepts, skills, 
and values. It is true in the case of organization, as in the case of objectives, 
that although a rationally planned instructional program requires a clear 
understanding of organizing elements and of organizing principles, many 
teachers have not given explicit consideration to these matters because they 
have accepted the traditional organization without examining the basis upon 
which such schemes of organization rest. Hence, it is quite possible for 
teaching to go on without the teacher being conscious of the scheme of 
organization and the elements of organization in the program he is teach- 
ing. The fact that a comprehensive achievement test cannot be constructed 
without some specification of these organizing elements and organizing 
principles stimulates the instructional staff to formulate their conceptions 
of the bases of organization and thus help to bring problems of organization 
more clearly to the attention of instructors. This is a significant contribu- 
tion, because it is not likely that organization will be made more effective 
unless it is clearly and explicitly recognized by those who are responsible 
for instruction. 

A second way in which educational measurement can contribute to the 
organization of instruction is by providing a means for testing the hypothe- 
ses about organization around which any given instructional program has 
been developed. For example, a certain social studies program was organ- 
ized in the elementary school on the hypothesis that each of the major 
concepts would be introduced by beginning with the life in the family, 
then progressively extending these understandings to the school, the com- 
munity, the state, the nation, and to the world community. On this basis 
such concepts as interdependence, human variability, political power, and 
the like were developed over a six-year period. To appraise the effectiveness 


of this organizing scheme, tests were constructed for these concepts at 
several levels in the sequence. It was found that a number of the concepts 
did not develop very satisfactorily when they were extended in this 
geographic sense from the family to wider and wider circles, ultimately 
reaching the international scene. This brought the organizing principle 
into question and led to its reformulation. The revised organizing prin- 
ciple was based on the idea that certain aspects of life are simpler and 
easier to comprehend than other aspects and, hence, that the concepts would 
be developed first in tlieir relation to these simpler aspects of life and then 
move on to other more complex aspects. The second scheme of organiza- 
tion was also appraised by the use of tests and v/as found to give somewhat 
better results than those obtained by the earlier scheme. 

As another illustration of the use of educational measurement in making 
an explicit test of hypotheses about organization, we may cite the case of 
the English course which was organized with the idea that integration 
could be achieved by having the English course deal with literary writing 
and asking the teachers of science, social studies, shop, and mathematics 
courses to take responsibility for applying the concepts and skills in writ- 
ing to their own particular fields. This principle of horizontal organiza- 
tion was appraised by giving tests that showed the degree to which stu- 
dents were able to utilize major skills in writing in connection with each 
of the areas of instruction. It was found that this scheme of organization 
was not totally satisfactory. In some cases the concepts developed in the 
literary field were not directly applicable to the writing of scientific papers 
or of papers in other fields, and in other cases the application required more 
understanding of the principles of writing than the teachers of the other 
fields had acquired. These results led to the modification of the scheme of 
organization so as to provide the initial development of writing principles 
and skills in the English class in connection with each of the types of 
writing commonly employed in other courses students were taking. After 
the initial development in the English class, the other classes provided for 
continued practice of these writing skills through periodic papers and re- 
ports. Judging from the test results, this organizing scheme proved to be 
more effective than the one previously used. These illustrations are but two 
of many that could be cited to show the way in which educational measure- 
ment is used to test the organizing principles of the curriculum. 

The foregoing discussion has indicated two ways in which achievement 
testing can contribute to the process of organizing learning experiences. 
In the first place, the development of a valid testing program challenges the 
staff to consider the problem of organization and requires an explicit 


formulation of the organizing elements and the principles of organization. 
Educational measurement also provides a means for testing these organiza- 
tional hypotheses so as to retain those that are effective and to discard or 
reformulate those which do not produce the cumulative effect which good 
organization demands. 

Measurement Aids in the Supervision and Administration 

of Instruction 

This chapter has thus far dealt with the way in which educational 
measurement contributes directly to the process of instruction. It has been 
noted that achievement testing is in itself one of the major phases of 
instruction and that in addition it helps the formulation of clear-cut educa- 
tional objectives, it provides assistance in the selection of content and 
learning experiences, and it aids in the development of an effective organi- 
zation of learning experiences. Next, the discussion deals with the contri- 
bution educational measurement can make to the supervision .of instruction 
and to its administration. 

Measurement Contributes to the Education of 
Teachers in Service 

Essentially, the supervision of instruction has two major functions. One 
is to provide for the continuing development of the teacher — in-service 
education — and the other, to provide for the coordination of instructional 
efforts. In providing ways by which teachers can i mprove their own kno wl- 
edge an d skill, achTeve m"ent testi n g has t go-important contr ibution s. In 
the first place, to work upon the s election or construction of appropriate 
achievement tests is an important kind of learning activity for the instruc- 
tional staff. It requires a careful foimulation of objectives and clear think- 
ing about the meaning of these objectives so that they can be defined 
explicitly in terms of the actual behavior changes desired in the students. 
It also requires careful consideration of the ways in which one's students 
might be expected to display the kind of behavior_which instruc tion aims 
t o develop , and this helps to clarify the teacher's instructional tasks and 
keep him from being subject-centered. It has often been maintained that 
teachers should be students of children rather than merely students of 
subject matter. The kind of focus of attention involved in the construction 
and selection of achievement tests is one means of helping teachers to 
study children. It requires continual thought about the potential reactions 
of children, the changes to be desired in their behavior, the ways in which 
these changes can be brought about, the ways in which this changed be- 


havior might be expressed within and without the classroom, and the 
ways in which it could be appraised validly. Surely this is an important 
contribution in the continued education of the instructional staff. 

Another important contribution to the education of the instructional 
staff is provided when the results of achievement tests are used by the 
supervisor and the instructional staff in evaluating the instructional pro- 
gram. When it becomes clear that a particular course brings about some 
of the desired changes in behavior but not others, it raises the question 
as to what is wrong with the course if it does not attain certain of the 
purposes desired. This focuses attention upon possible modification of 
content at those points where the course needs to be improved, and it 
stimulates proposals by teachers for experimentation. As experiments are 
initiated, achievement tests again provide the means for determining 
whether a given experimental program is more effective than the program 
previously followed. Such use of test results stimulates further development 
of understanding and skill on the part of teachers by giving them an effec- 
tive means of investigating instruction. 

The use of a measurement program as a major supervisory device has 
one marked advantage over the sole dependence upon supervisory observa- 
tions of teachers at work and later discussion of the observations. It 
focuses the attention of the teacher and the supervisor upon the students 
and their achievement rather than upon the teacher and his procedures. 
The older methods of supervision, which largely depended upon classroom 
visitation and subsequent conferences during which the supervisor criticized 
the materials and procedures used by the teacher, frequently caused em- 
barrassment, resentment, or defense reactions on the part of the teacher. 
He often felt that the supervisor's criticism represented only personal 
opinion and that the teacher's judgment about particular material or a pro- 
cedure was as valid as the supervisor's. He also felt ill at ease because the 
supervisor was looking at his activities closely and critically. Under these 
conditions neither the instruction observed nor the conference that fol- 
lowed were as satisfactory as might be desired. On the other hand, if the 
focus of supervision is upon the study of test results, the question becomes, 
"How does Johnny (or how does the class) react to this kind of material.'*" 
When defects are identified, the next questions become, "Why is it that 
Johnny (or Frank or the class as a whole) has difficulty with this kind of 
material?" "What can be done about it?" Instead of putting the teacher on 
the defensive, such questions are likely to cause him to be as much inter- 
ested as the supervisor in noting difficulties and figuring out ways of 
overcoming these difficulties. This kind of objective focus of attention in 
supervision is likely to be more conducive to teacher cooperation than is 


supervision which directs its attention to the teacher rather than to the 
students involved. This is a valuable characteristic of the measurement 
program as a supervisory device. 

Measurement Helps in the Coordination of 
Instructional Efforts 

In addition to in-service education another primary function of supervi- 
sion is coordination. Education in school or college involves a complex 
program using a number of different people, teachers of different subjects, 
teachers of different grade levels, supervisory and administrative officers, all 
presumably v^orking toward certain common ends that are involved in in- 
ducting young people into responsible adulthood in our society and 
achieving desirable control of their current problems. Any kind of complex 
program that uses a number of different people all working toward a 
common end is likely to get out of adjustment, to be uncoordinated. The 
function of coordination is to see that these various persons and groups 
make an effective contribution to the common purposes of the program. 

Coordination of a professional group is not as simple as it appears to be 
in coordinating a large number of workers on an assembly line. Many 
decisions, many adjustments in a professional program are made as the 
program develops, and cannot be predicted with precision in advance. 
Instructional supervision is a cooperative activity involving a good deal of 
group planning and sharing of ideas. This kind of coordination requires 
two-way communication rather than directives sent out by the supervisor. 
Essentially instructional coordination involves (1) sharing common pur- 
poses on the part of all staff members, ( 2 ) an increased understanding of 
how to attain these common purposes so that each staff member sees his 
own job in relation to these purposes, and (3) on the part of each staff 
member, an increased understanding of himself and how to use his particu- 
lar competence in the program to attain the common purposes. 

Educational measurement contributes to coordination in several ways. 
Because an achievement testing program must begin with agreement upon 
objectives and a clear formulation and definition of them, this process in 
itself is one way of getting common purposes shared by all staff members 
and having these common purposes clearly understood. Furthermore, as the 
content and learning experiences are appraised by use of tests, each 
teacher gets a clearer view of the kinds of materials and procedures used by 
other teachers and their effectiveness. As the organization of learning ex- 
periences is appraised by testing, each teacher gets a clearer view of the 
kind of contribution each field and grade is making to the common 
purposes of the school. The study of test results to see what changes take 


place in students in relation to the different courses they are taking pro- 
vides a way for the teacher to see his own job in relation to the work of 
others and both in relation to the objectives of the school more clearly 
than is provided by general statements of the expected division of labor. 
Such study also makes some contribution to the teacher's understanding of 
his own competence and how it can contribute to the program at hand. Test 
results may show that teacher A has brought about marked changes in 
objectives 1 and 3, whereas teacher B has been more effective in connec- 
tion with objectives 2 and 4. As these results are studied and questions 
are raised about their meaning, it becomes possible for the supervisor to 
indicate more clearly the kind of contribution which each teacher is best 
prepared to make and to suggest ways in which his own particular compe- 
tence can be used even more effectively in helping students develop in 
desired ways. The important thing about the measurement program is that 
it focuses attention upon objectives and their attainment, which is the 
fundamental basis for coordination, that is, the relation of means to the 
essential ends of education. 

Measurement Helps in Administration 

Educational measurement can make a contribution to administration as 
well as to instruction and supervision. Administration has eight major 
functions: (1) the provision of personnel to carry on the instructional 
program, (2) the organization of the staff and effective relations among 
members of the staff and between members of the staff and appropriate' 
groups outside the school, (3) the development of policy and plans for 
the educational program, (4) the provision of supplies, equipment, build- 
ings, and the like that are needed for the work, ( 5 ) the creation of condi- 
tions requisite for effective work, (6) the provision of funds to carry on 
the work, ( 7 ) the evaluation of the work as a whole and its various parts, 
and (8) the interpretation of the work and needs to the public and to 
special groups whose understanding is essential for the adequate support 
and development of the educational program. 

In examining this list of functions, it appears most obvious that educa- 
tional measurement contributes to the evaluation of the work of the school 
and its various parts. Measurement also contributes to several other ad- 
ministrative functions. In selecting personnel, scores on achievement tests 
represent one index of competence. Furthermore, the quality of work the in- 
structor does is partly indicated by the measured progress made by his stu- 
dents. Hence, educational measurement provides two, among several, 
indices helpful in selecting and promoting the instructor. The results of 
achievement testing make not more than a minor contribution in appraising 


the eflfectiveness of staff organization. However, in the development of 
pohcy and the planning of program the results of achievement tests help 
to indicate particular points of emphasis needed in a particular school, and 
the process of planning for achievement tests provides, as has been pointed 
out earlier, an important basis for getting common understanding of policy 
and common agreement about the ends of the educational program. The 
results of achievement tests also provide an important basis for identifying 
instructional problems needing attack and for determining the effectiveness 
of efforts at instructional improvements. 

The results of achievement tests provide helpful evidence that can be 
used in the selection of effective instructional materials and facilities. 
Buildings have too often been built primarily as monuments to the archi- 
tects. Classrooms have commonly been planned in terms of convenience of 
construction or the patterns and traditions of the building industry. In- 
structional facilities are often provided on the basis of advertising claims 
and the whims of purchasing departments. All of these materials are pri- 
marily means for bringing about the objectives of education and can be 
selected more wisely and more validly on the basis of evidence as to their 
effectiveness in promoting the objectives of education. Carefully planned 
programs of achievement testing can provide data regarding the probable 
contribution of various kinds of buildings, facilities, supplies, and equip- 
ment that are more dependable than the usual bases for making decisions 
on these matters. 

The results of a carefully planned program of educational measurement 
are also useful in appraising the conditions set up to promote staff growth 
and development. One important criterion of staff development and staff 
efficiency lies in the educational results the staff is achieving. These results 
can be inferred from the results of a carefully planned program of testing 
of the students' progress. 

Measurement makes no direct contribution to securing funds except as 
it contributes to public relations. The interpretation of the work and needs 
of the school to the public and special groups whose cooperation is essen- 
tial can be much more validly made when evidence is available as to the 
results being obtained and the difficulties being encountered than when the 
content of public relations is based on enrollments alone or on matters 
which do not directly reflect the desired results of the educational program. 
There was a time when American education was taken on faith, when it 
was assumed by the public that mere enrollment of the child in school was 
a good thing. Increasingly, however, the lack of evidence regarding the 
school's effectiveness has made it possible for public criticism of education 
frequently to go unchallenged. If the school is to meet criticism intelli- 


gently, if it is to inform the public regarding its real achievements and its 
real difficulties, it must base its reporting not only upon a description of the 
educational program and upon the statistics of the students enrolled, but 
it must also indicate the kinds of results being achieved. It must have the 
data from programs of educational measurement to show clearly what the 
school is now accomplishing, where it is having difficulty, and what kinds of 
further support it needs to do a better job. 

It can thus be seen that educational measurement has a contribution to 
make to administration at several important points in addition to the ob- 
vious contribution that it makes in providing the basis for a careful ap- 
praisal of the work of the school. As administrators recognize the many 
values accruing from a carefully planned testing program, educational 
measurement can make its potential contribution more widely effective. 

Measurement Can Help Instruction More When Certain 
Conditions Are Met 

Many potential contributions that educational measurement can make 
to the improvement of the content, organization, supervision, and admin- 
istration of instruction have been suggested in the previous sections of this 
chapter. At present these potential contributions are rarely achieved in full 
measure. In order to realize them, certain important conditions are neces- 
sary in organizing and developing an achievement testing program. Among 
these, three are of particular importance. First, it is essential that the out- 
comes selected for testing should be the important objectives of the instruc- 
tional program. For example, if the instructional program emphasizes pri- 
marily the development of certain intellectual skills and the acquisition 
of some major concepts, whereas the testing program concentrates upon 
the memorization of isolated and specific information, the testing program 
is either disregarded as being relatively unimportant or it tends to deflect 
the educational program from its primary objectives. The testing program 
must, therefore, provide a measure of the degree to which the important 
objectives of the instructional program are being attained. 

In the second place, it is essential that achievement testing be planned 
and developed as an integral part of the program of curriculum and in- 
struction. The tendency in some schools to set up a test division separate 
from the division of instruction is unwise. The tests should be based upon 
the objectives of instruction; the results should aid in the improvement of 
instruction. Hence, only insofar as tests are selected or constructed in terms 
of the instructional program and the results are immediately available for 
instructional planning can their greatest values be obtained. If achievement 
testing is not an integral part of curriculum and instruction, it is likely 


to be viewed with suspicion, or to go off in directions other than those 
planned in the instructional program, or to operate as a parallel activity 
but not one influencing instructional development. 

A third condition for realizing the potential contributions of measure- 
ment to instruction is that the achievement testing program be guided and 
controlled by those responsible for the guidance and control of the instruc- 
tional program.. Because achievement tests provide such a concrete indica- 
tion of objectives, because they have great influence upon students and 
teachers in directing their efforts, it is possible for the achievement tests 
to be more influential in the actual learning that goes on in the school than 
the curriculum outline or any other materials prepared by the instructional 
staff. Hence, if the testing program is not under the direction and control 
of those responsible for the instructional program, measurement will 
operate as a powerful influence in directing learning in the school which 
is not in harmony at all points with the instructional program of the school. 

A fourth condition for realizing the potential contributions of measure- 
ment to instruction is to encourage teachers to take part in the construction 
of all types of tests, since many standard tests and teacher-made testing 
instruments constructed for one school system are not applicable without 
adaptations to the curriculum of another school system. It is necessary for 
interested persons and teacher committees to work upon the construction of 
evaluation techniques designed to meet the special needs of their own 
school curriculum and courses of study. These teacher committees may 
work informally with the school's division of tests and measurements. For 
example, a committee of five chemistry teachers might work upon a 
battery of tests designed to measure various aspects of thinking in the 
subject of chemistry. They could develop subtests such as obtaining facts 
about chemistry from various sources, interpretation of facts and data in 
chemistry, and the application of principles of chemistry to new situations. 
Perhaps a committee of mathematics teachers working cooperatively with 
the school's division of tests and measurements could develop tests for the 
objectives of mathematics. This committee is strategically placed so that 
they can devise a test which is adapted to the school situation of which 
they are a part. The committee of mathematics teachers might embark upon 
some of the more challenging test areas, for example, understanding 
quantitative relationships, critical thinking in mathematics, and interests 
and attitudes related to mathematics. 

Many of the present inadequacies of instruction are due to the common 
tradition in many schools of making no significant changes in the instruc- 
tional programs and procedures over the years, or the equally unintelligent 
practice in many other schools of making changes on the basis of current 


fads or the personal whims of the staff. The responsibilities expected of 
educational institutions in our times are wholly impossible of attainment 
unless great improvements are made in the effectiveness of instruction. 
Educational measurement can have a profound influence in the improve- 
ment of instruction; but to do so, it must be viewed as an integral part of 
instruction, its planning must go hand in hand with instructional planning, 
and the results must be used continuously to guide the planning and de- 
velopment of the curriculum. 

Selected References 

1. Administrative Officers of Public and Private Schools, Proceedings of the 
Ninth Annual Conference of. Evaluating the Work of the School. Edited by William C. 
Reavis. Chicago: University of Chicago Press, 1940. 

2. Brown, Clara M. Evaluatiojt and Investigation in Home Economics. New York: 
Appleton-Century-Crofts, 1941. 

3. Clarke, H. Harrison. The Application of Measurement to Health and Physical 
Education. New York: Prentice-Hall, 1945. 

4. Cooperation in General Education: A Final Report of the Executive Committee of 
the Cooperative Study in General Education. Washington: American Council on 
Education, 1947. 

5. Douglass, H. R. (ed.). The High School Curriculum. New York: Ronald Press, 

6. Dunkel, Harold B. General Education in the Humanities. Washington: American 
Council on Education, 1947. 

7. Greene, Harry A.; Jorgensen, Albert N. ; and Gerberich, J. Raymond. Measure- 
ment and Evaluation in the Elementary School. New York: Longmans, Green, 1942. 

8. . Measurement and Evaluation in the Secondary School. New York: Longmans, 

Green, 1943. 

9. Hawkes, Herbert E.; Lindquist, E. F.; and Mann, C. R. The Construction and 
Use of Achieve?nent Examinations. Boston: Houghton Mifflin, 1936. 

10. HoRTON, Clark W. Achievement Tests in Relation to Teaching Objectives in Gen- 
eral College Botany. Botanical Society of America, 1939. 

11. Leonard, J. Paul. Developing the Secondary School Curriculum. New York: Rine- 
hart & Co., 1947. Chap. 7, "Evaluating the School and the Pupil," pp. 209-44; 
chap. 15, ""Evaluating Pupil Learning," pp. 489-512. 

12. Levi, Albert W. General Education in the Social Sciences. Washington: American 
Council on Education, 1948. 

13. McBroom, M. E. Educational Measurements in the Elementary School. New York: 
McGraw-Hill Book Co., 1939. 

14. National Society for the Study of Education. Forty-fourth Yearbook, Part I: 
American Education in the Postwar Period. Chicago: LIniversity of Chicago Press, 1945. 
Especially the chapter "'General Techniques of Curriculum Planning." 

15. . Forty-sixth Yearbook, Part I: Science Education in American Schools. Chi- 
cago: LIniversity of Chicago Press, 1947. 

16. . Committee on the Measurejnent of Understanding. William A. Brownell, 

chairman. Forty-fifth Yearbook, Part I: The Measurement of Understanding. Chicago: 
University of Chicago Press, 1946. 

17. New York Public Schools, Bureau of Reference, Research and Statistics, 
Division of Tests and Measurements. Determining Readiness for Reading. New York: 

18. Orleans, J. S. Measurement in Education. New York: T. Nelson & Sons, 1937. 
Chap. 1, pp. 16-24. 


19. PuRNELL, RussELi. T., and Davis, Robert A. Directing Learning by Teacher-Made 
Tests. Boulder: Extension Division, University of Colorado, 1939- 

20. Remmers, H. H., and Gage, N. L. Educational Measurement and Evaluation. New 
York: Harper & Bros., 1943. 

21. Ross, C. C. Measurement in Today's Schools. 2nd ed. New York: Prentice-Hall, 1947. 

22. Smith, Eugene R.; Tyler, Ralph W.; and the Evaluation Staff [of the 
Commission on the Relation of School and College of the Progressive Education 
Association]. Appraising and Recording Student Progress. New York: Harper & Bros., 

23. Taba, Hilda. "The Functions of Evaluation," reprint from Childhood Education, 
journal of the Association for Childhood Education, Feb. 1939. 

24. Thayer, V. T.; Zachry, Caroline B.; and Kotinsky, Ruth, for the Commission 
on Secondary School Curriculum. Reorganizing Secondary Education. ("Progressive 
Education Association Publications.") New York: Appleton-Century-Crofts, 1938. 

25. Traxler, Arthur E. The Nature and Use of Reading Tests. New York: Educational 
Records Bureau, 1941. 

26. . The Use of Test Results in Diagnosis in the Tool Subjects. New York: Edu- 
cational Records Bureau, 1942. 

27. Troyer, Maurice, and Pace, C. Robert, for the Commission on Teacher Educa- 
tion. Evaluation in Teacher Education. Washington: American Council on Education, 

3. The Functions of Measurenncnt 
in Counseling 

By John G. Darley 

UniversHy of Minnesota 

Gordon V. Anderson 
The University of Texas 

Collaborators: Margaret Bennett, Pasadena Public Schools; A. J. 
Brumbaugh, Sbimer College; Galen A. Jones, U. S. Office of Education; 
Robert H. Mathewson, Harvard University; E. K. Strong, Stanford Uni- 
versity; Donald E. Super, Teachers College, Columbia University 

The Process of Counseling 

Counseling is the process in which information about the 
individual and about his environment is organized and reviewed in such 
a way as to aid him in reaching workable solutions to a variety of adjust- 
ment problems in the normal range of behavior. In this sense counseling is 
a specialized phase of a personnel service or guidance program. It is im- 
portant also to realize that counseling covers a normal range of planning 
and of adjustment problems, albeit this range is rather broad. If we move 
beyond it in the direction of extreme deviation, we enter the area of 
medical psychology and psychiatry, involving serious mental illness with 
or without organic involvement. If we move beyond this normal range in 
the other direction, we enter the area of routine information-giving and 
routine orientation, as exemplified in college catalogues, minimum admis- 
sion requirements, and application procedures. But problems beyond the 
level of mere information-giving and just short of serious mental illness or 
social malfunction are persistent, varied, often complex, and often of long 

There is, for example, the problem of vocational planning which faces 
nearly all individuals in our culture. The student, the family, and the 
educational system will all spend considerable time and effort in arriving at 
a workable solution to this problem. To a certain extent a definite voca- 
tional plan tends to determine the student's educational plans and the par- 
ticular curriculum he will follow; conversely, the choice of an interesting 
curriculum not infrequently determines the subsequent choice of a voca- 



tion. But even when the problem of curricuhir choice is seemingly solved 
by the vocational decision, additional educational problems involving 
study habits, reading skills, patterns of ability or achievement, discrepancies 
between achievement and aspiration, and motivation may remain for solu- 
tion. In the case of a vocationally undecided student, the educational 
planning itself may become a major task if an effort is made to provide try- 
out experiences in a variety of vocational fields. 

There is also a considerable range of social and personal adjustment 
problems with which students are faced in the later years of adolescence 
and the first period of young adulthood. The transition from high school 
to college often involves new independence, new social relationships, and 
new social competition. Family relations must be restructured and different 
social mores must be assimilated. 

Since the whole organism is involved in this continuous process of 
adjustment and readjustment, emotional disturbances, neurotic behavior 
patterns, and similar maladaptive processes may develop in the individual. 

These, then, are some of the problems which students may bring to 
counselors. For the purposes of this discussion, we are concerned with the 
problems of students in the later high school years and the college years. 
Thus, the counselor is here considered a specialist in the diagnosis and 
treatment of the adjustment problems presented by young people in our 
highly structured educational system. He is one member of the large group 
of personal service practitioners that includes doctors, psychiatrists, social 
workers, visiting teachers, school advisers, and child guidance specialists. 

As is true of each of these specialists, the environment in which the 
counselor works partially conditions his techniques, his activities, and his 
objectives. Higher education ordinarily involves fairly rigid curricular 
patterns, for which societal and often legal compulsions to conformity 
exist. Graduation from an approved medical school is an example of a 
legal compulsion to conformity before the individual may enter the speci- 
fied profession, whereas employers' insistence on graduation from a school 
of journalism or a school of business administration is an example of a 
societal compulsion to conformity. 

Furthermore, those families that send their children on to higher educa- 
tion seem often to view the process as a specialized form of unemployment 
insurance or social sine qua non; it is a necessary finishing process which, 
in the opinion of parents, can be completed if only their children work 
hard enough. The fact that approximately six out of ten college entrants 
fail to survive to graduation gives rise to disappointments and for parents 
frustrations that may have a serious effect on the personalities of their 


From the standpoint of the college faculty, education too often involves 
primarily the accumulation of credits and grades in an orderly sequence; 
from this accomplishment in this sequence certain inferences are made 
regarding intellectual growth, maturity, and subject-matter mastery. 

The counselor, as a staff member of the educational system, works 
within the boundaries laid down by these prevailing attitudes, demands, 
and institutional mores. He may not agree entirely with them; he may 
recognize their limitations. But this is essentially the institutional frame- 
work within which he works, just as certain social workers must deal with 
the misfits in a highly competitive economic structure, or just as certain 
doctors must deal with the industrial hygiene problems of a particular 
industry or area. With respect to the institutional context in which coun- 
seling is conducted, it should be remembered that the counseling process 
itself, and the related activity of measurement, will be affected by the 
dominating institutional purpose. If the controlling aim is selection, recom- 
mendation, or advisement for a particular institutional objective, the pro- 
cedures of counseling and measurement may be different than if the 
governing purpose is more general, more concerned with individual growth 
and development. 

By virtue of his training and clinical experience, however, the counselor 
tends to be actively concerned with more aspects of the whole individual's 
behavior than does the educational system itself. He has before him always, 
however dimly seen, a multidimensional criterion of the adjustment of the 
individual student. He attempts to predict, however accurately, many things 
to come for each student. He is concerned not only with success as measured 
by grades earned and requirements completed, but also with success as 
measured by individual satisfaction and staying power in occupational 
tasks. He attempts to forecast not only the outcomes of a remedial reading 
program, but also the effective social adjustment of the individual student, 
as influenced by social experiences and growth during the college years. He 
must deal not only with the immediate emotional conflict, but also with 
speculations regarding the effect of this conflict on the emotional growth 
of the individual student. 

The counselor is essentially a clinician who, as a rule, deals with one 
student at a time, sees that student against the background of the competi- 
tive demands to be met not only in the particular institution but also in the 
life situations beyond the institution, and employs with that student the 
most effective diagnosis and therapeutic skills he possesses. Because he 
carries on this activity in an educational institution, one outcome of his 
work should be that the student arrives in the classroom situation relatively 
free from distractions that interfere with learning and more positively 
oriented toward learning as a means to achieve adult adjustment. 


The uses of psychological measurement in counseling, ao dofiacd her cr 
are/lhose which relate to individual problems of adjustment, orientation, 
and development that are brought to the counselors for help, and those 
which relate to group problems of growtJi, achievement, educational selec- 
tion, and classification of students that occur in all institutions of learning. 
These uses may be summarized as follows: 

1. The objective appraisal of personality for better self-understanding 
and self -direction on the part of the individual himself. 

2. The accurate comparison of individual performance with the per- 
formance of others for the purposes of selection, recommendation, 
and self-understanding. 

3. Improved basis of prediction as to likelihood of success in any activity 
in which prospective performance can be measured and compared. 

4. Evaluation of personal characteristics in relation to characteristics 
required for educational and occupational performance. 

5. Evaluation of achievement and growth — individual and group. 

6. Disclosure of capacity and potentiality as well as the diagnosis of 
mental disabilities, deficiencies, and aberrations. 

The applications of measurement stem from the basic need for objective 
comparative data upon individual behavior, subject as little as possible to 
the vagaries of subjective surmise and interpretationr~J 

The Function of Measurement in the Counseling Process 

Psychological measurement functions in this counseling process as a 
means to an end: to help identify individual strengths and weaknesses 
within an individual and between individuals relative to competitive levels; 
to provide insight and understanding for the student and the counselor; to 
structure diagnostic descriptions either in cause-efifect terms or in terms of 
covariation of behavioral elements; and to permit more accurate predictions 
than would otherwise be possible. One of the continuing emphases of 
American psychology is to be found in the field of psychometrics Broadly 
conceived. This has affected counseling programs to a great extent. Through 
the appraisal of human behavior in its measurable elements and the inter- 
relation and predictive power of the resultant data, diagnostic clues are 
provided for more accurate and meaningful case work. 

It should, however, be pointed out that the approach to counseling 
through psychometrics alone has been misleading, for the counselor's em- 
phasis upon the rationale of measurement and prediction, and the rational 
nature of the processes through which he gains insight into a counselee's 
situation, may lead him to lose sight of the fact that what are to him objec- 
tive data are often highly emotional matters to the counselee. As many 


eclectic counselors have long known, eflPective vocational and educational 
counseling requires judicious variation of rational, or information-giving, 
and emotional, or attitude-clarifying, approaches. 

Prediction and diagnostic description on the basis of quantitative data 
are, however, characteristic of many aspects of our social structure. Large- 
scale industrial and military selection programs can be made more efficient 
by the development of good predictions of subsequent achievement. Medi- 
cal science rests heavily on the diagnostic clues and predictive meaning of 
physiological measurements. Psychiatry and social work profit in part from 
quantifiable data on individual behavior. In many situations the choice of 
a particular method of helping an individual to reach a state of satisfactory 
adjustment is dependent upon the particular problem judged to exist, or the 
particular diagnosis that has been made. 

That some counselors are hesitant to exercise the appraisal function 
reflects no doubt upon the principle involved, but rather upon the fact thai 
our present knowledge of personality appraisal is rudimentary. This does 
not mean, however, that we have to give up our attempts to assay and 
appraise personality either on the ground of individual uniqueness (which 
no one will contest) or on the ground that ultimate complete knowledge 
may be impossible. Moreover, even in the present elementary state of 
knowledge, we may employ such knowledge for what it is worth in coun- 
seling situations for the enlightenment of individual clients and for the 
sake of improving our practices through the discipline of case experience. 

The individual who receives such information from a counselor may 
learn something of value about himself not otherwise ascertainable which 
in no wise would seem to detract from his freedom of personality. 

Thus, the "appraisal of the individual" and the conveyance of related 
personal information is important in the counseling process to the extent 
that our findings have validity. Meanwhile, we retain this function, even 
in its present state of imperfection, because it is one of the basic reasons 
why individuals come to counselors for assistance. Insofar as it can con- 
tribute to the improvement of individual and social adjustment, we em- 
ploy it. 

In these respects the appraisal of individuals in the field of counseling 
may be like that undertaken in the early days of medicine. That develop- 
ments will occur far beyond what we now know is a foregone conclusion. 
That individuals will continue to come to counselors for such appraisals is 
also a foregone conclusion. They will receive direct and explicit help, but 
this will not necessarily be "determinative" nor "directive." 

However, even though advances in psychometric techniques have been 
of great help to counselors in understanding human behavior, counseling 


is considerably more than a matter of measurement alone. The student must 
be motivated to seek counseling help, in the same way that the patient must 
want the doctor's help, before either counseling or medical care can be 
fully effective. And the counselor must be more than a test interpreter. 
Counselors, and other personnel workers, must recognize at all times that 
measurement provides only a limited view of certain aspects of personality, 
the totality of which constitutes a dynamic, unique whole. This means, 
among other things, that test scores cannot be utilized effectively except 
in relation to other known, or surmised, factors in the total problem situa- 
tion. They can never be rightly interpreted except in context. 

One of the significant features distinguishing one mode of counseling 
from another in our current stage of development is the degree of emphasis 
which the counselor places upon the measured, as against the nonmeasured, 
aspects of the problem situation as appraised jointly by himself and the 
client. Some counselors tend to put considerably more weight upon test 
scores than others. 

It is pointed out by those who tend to minimize the value of measure- 
ment data that the leeway afforded the individual is sufficiently wide in 
many actual performance situations and the influence of nonmeasured varia- 
bles is so high (as, for example, drive or motivation) that the characteris- 
tics which we are able to measure may not mean as much in life activity as 
the nonmeasurable factors. 

In recent years Rogers and his students {18, pp. 249-52; 19) have ably 
presented this viewpoint on counseling in which the psychometric aspects 
of the total process are minimized. The emphasis is placed heavily on the 
therapeutic aspects of counseling and on a particular method of therapy. 
This viewpoint represents a healthy reaction against too great concern with 
measurement alone. As has been noted in the literature, the psychometric 
approach to counseling tends to create a passive mental-set in the coun- 
sel ee: expecting tests, he also expects prescriptive answers to his problems. 
The counselor then has the difficult task of shifting back to the counselee 
responsibility for analyzing his situation and for making decisions. But if, 
on the contrary, the counselor begins somewhat nondirectively and uses tests 
incidentally to the counseling procedure when they seem likely to provide 
data needed by the counselee, it seems only natural to the counselee that he 
should himself evaluate the results, checking his interpretations against 
those of the counselor, and arriving at his own conclusions with the coun- 
selor's guidance. This suggests that the use of tests can be reconciled with 
counseling and with therapy. And it must not be overlooked that some of 
the debate emerging from Rogers' work relates to differences in nomen- 
clature and methods of therapy, or to the relative importance of diagnostic 


descriptions as determiners of the kind of therapy that may be effective. 
Such differences can be resolved by research and chnical studies. It is quite 
possible that the need for the kind of precise information supplied by 
measurement will emerge sooner or later in the majority of counseling 
relationships, whether the counselor takes the initiative in getting this 
information or whether the student comes to the point of requesting tests 
as sources of self-understanding. 

Significant Measurennents in the Counseling Process 

In the design of the ordinary prediction experiment, an attempt is made 
to use predictors showing relatively low intercorrelations among themselves 
and relatively high correlations with some criterion of later success. By 
appropriate statistical treatment, the contribution of each separate predictor 
can be maximi2ed and weighted into a multiple regression equation that 
gives the best prediction of the criterion measure. This is essentially an 
actuarial procedure by which the experimenter hopes to improve, but cannot 
make perfect, his selection for success in the criterion task. 

Somewhere in the total process of counseling students it is often neces- 
sary to get an answer to a question about the student's chances of success 
in a given task in curriculum or vocation. To the extent that this student 
is a member of the sample upon which the multiple regression equation 
was established, an actuarial, or probability, answer can be given to this 
question, and the answer derives from test performance. But there are, in 
addition, factors of maturity, motivation, emotional stability, financial sup- 
port, and personal adjustment, no one of which is ordinarily itemized in 
the regression equation and any one of which may determine the success or 
failure of the individual student. Thus, the counselor finds himself "shad- 
ing" the actuarial prediction one way or the other, depending upon his 
assessment of the import of these other factors. The more thoroughly he 
understands the student, the more conscious he may be of this shading 
process. No claim is made that the only end of counseling is prediction; 
this discussion merely emphasizes that prediction is one part of the counsel- 
ing process, whether it involves prediction of grades, job success, personal 
adjustment, or marital adjustment. Prediction may be done from rigid appli- 
cation of regression weights or from the exhaustive case study implicit in 
Allport's ( i ) discussion of personality, wherein so much is learned about 
one individual that he becomes a unique category not covered by the usual 
actuarial formula. 

For general counseling purposes, a predictive and diagnostic structuring 
emerges when the following kinds of data are available: 


General scholastic ability 

Differential measures of achievement 

Evidence of special aptitudes or disabilities 


Personality structure and dynamics, including attitudes and beliefs 

Socioeconomic and cultural derivation and relations 

Health and physical attributes 
Such items of information meet the general demands of the prediction 
experiment; each has some relation to subsequent adjustment or success; 
all are intercorrelated to some degree but not to a high degree; some may 
act to suppress or depress the effectiveness of other aspects of the individual. 
Measurement devices of varying degrees of accuracy and validity exist in 
each of these areas of behavior. 

It is, furthermore, a primary thesis of counseling — and more broadly 
conceived, of psychology itself — that the key to the process of adjustment 
demanded by life situations in our culture is to be found in the continuous 
interplay, or balance, or patterning and utilization of these behavioral ele- 
ments vi^ith situational factors. Within broad limits probably the majority 
of people achieve an integration or adaptive balance that permits orderly 
adjustments to life demands. In extreme cases maladaptive balances are 
characteristic and lead to extreme medico-legal solutions. There may be 
transient periods of disintegration and maladaptation requiring less pro- 
longed forms of therapy or relief. And there may be situations in which 
the individual needs only a precise item of information about himself or 
his opportunities in order to move on quickly to a solution of a particular 

With regard to the counseling of students, it is illuminating to see the 
kinds of questions which may be answered to some degree by assessment 
of the behavioral elements cited above. 

1. Questions regarding vocational planning. From the limited research 
data available, it is possible to conceive of families of occupations into 
which the thousands of payroll titles can be grouped. Jobs within these 
broad families demand similar patterns of ability, achievement, aptitude, 
personality, and interests. The measurement of these behavioral elements, 
therefore, can lead to rough predictions of the job families in which greater 
or lesser success and satisfaction may be expected. At the college level the 
choice of a vocation may automatically carry with it the choice of an estab- 
lished curriculum, and the prediction problem may then be reduced to the 
prediction of success in such a curriculum. Similarly, success and consequent 
satisfaction in a given curriculum may lead to choice of a related occupation 


as the means of continuing tlie experience of success and satisfaction. 

2. Questions regai-dmg underachievement . Many students fail to live up 
to their expectations in college work. The possible causal or conditional 
factors associated with this problem are numerous and of varying degrees 
of complexity. At the level of identification (diagnosis), measurement of 
differential reading skills, of personality characteristics, of differential levels 
of high school and college achievement may all provide leads or clues to 
the source of the difficulty and thus point up effective ameliorative or 
curative opportunities. 

3. Questions regarding personal development and adjustment. In the 
normal evolution of a counseling relationship, students are often eager to 
explore their problems of personality, either with regard to difficult situa- 
tions or difficult reaction patterns within themselves. Under such conditions 
test methods of assessing personality structure and dynamics may serve a 
useful purpose in isolating or defining the area of concern to the student, 
when combined with appropriate interview techniques in which diagnosis 
and therapy proceed simultaneously and reciprocally. 

4. Questions regarding motivation and interests. Even in those cases 
where differential educational or vocational prediction is clear-cut according 
to the usual criteria of success, such as grades or earning capacity, the part 
played by the individual's own interests and desires looms large as a final 
determinant of his behavior. Again, measurement devices are useful in 
identifying and classifying some of these comparative interests and motiva- 
tional factors. 

As has been pointed out earlier, the collection and use of psychometric 
data are only parts of the counseling process. Skillful identification of stu- 
dent problems and effective therapy for them do not follow automatically. 
But it is important to recognize that counseling, like so many other life 
situations, involves choice decisions from among various alternative plans 
of action open to the individual; these choices, in turn, are based on judg- 
ments regarding the behavioral factors related to successful outcomes of 
choice; and these judgments of behavior must be accurately made. Tests 
are valuable in the extent to which they improve the accuracy of inescapable 
judgments. The extent to which they do indeed improve the counselee's 
and not just the counselor's judgments depends on the readiness of the 
client to accept the insights of others or the verdicts of experience, and 
upon the counselor's skill in helping the client to find out what he could 
more easily, but less effectively, be told. This sequence of events is equally 
characteristic of the various schools of counseling which appear, in the 
literature, to represent such divergent viewpoints. 


The Research Problems in Human Adjustment 

It was suggested earlier that continuing adjustment is a function of the 
patterning or interplay of various aspects of behavior with situational de- 
mands. There is one additional principle that must also be stated: human 
adjustment follows from orderly processes of growth or development of 
the personality as a whole and its components. Measurement may reflect 
or "spot" these orderly growth processes as they unfold, and may indicate 
deviate or atypical development as well. It is appropriate to specify four 
properties of measurement devices that enhance their value as aids in mak- 
ing judgments or diagnoses or predictions in the counseling situation, in 
terms of understanding orderly growth. 

There is first the property of reliability or accuracy. The current methods 
of test construction maximize the precision and consistency with which 
behavior can be measured. In counseling situations, as in many other situa- 
tions, this increased precision is a necessary antidote to tendencies toward 
over- or under-estimation of relevant behavior by either the counselor or 
the student. 

There is second the property of validity or meaning or predictive power. 
In spite of all the difficulties of locating adequate criteria of vocational 
success, or educational success, or personal adjustment, good psychological 
tests carry some forecasting power for things to come in the life of the in- 
dividual student, and this predictive value is a balance wheel again in the 
wishful thinking and perennial optimism with which students approach 
many long-range decisions. 

There is third the property of economy of effort. As short, standardized 
samples of behavior, tests can often supply in a short time and at relatively 
low cost a basis of judgment-making that is a practical substitute for trial- 
and-error decisions in our culture. Even if students had endless time and 
resources to try out a variety of plans aimed at educational, vocational, and 
personal adjustment, they might still not arrive at a satisfactory end result. 
Our culture has increased the complexity and variety of choices open 
to students to such an extent that more economical aids to judgment-making 
are requisite. 

The final property of tests is found in their normative aspect, or their 
indication of an individual's standing relative to others of similar back- 
ground, experience, or developmental status. This property is related not 
only to normal growth processes, but also to the extent of deviation from 
normal growth or status as in certain types of personality measurement. 
Other chapters of this text will discuss in detail all these properties of 
tests; they are mentioned here in the context of counseling because they 


apply with special cogency to that definition of counseling in which infor- 
mation about the student is so organized as to provide an improved basis 
for workable solutions to his own problems. 

Let us return now to measurement in relation to basic research problems 
of human adjustment. There are for counseling purposes four fundamental 
aspects of human adjustment, defined as the interplay of behavior resulting 
from orderly growth, that must be considered. To each of these, testing has 
provided both theoretical and clinical insights. 

First, the counselor must have some knowledge of the origin, differential 
rates of growth, and extent of modifiability of human behavior. Without 
becoming involved here in the heredity versus environment conflict, it is 
sufiicient to point out that much of the information on these topics derives 
from careful psychometric studies and much of what the counselor does is 
based either on his interpretation of these researches or on his unsupported 
and often unverbalized beliefs about the modifiability of human behavior. 
If, for example, it is believed that hard work is the primary requirement 
for mastery of any curriculum or any vocation, diagnosis will be aimed at 
assessing willingness to work hard, and therapy will be aimed at maximiz- 
ing this willingness. If, on the other hand, it is believed that hierarchies 
and patterns of abilities are relatively fixed and predictive by the time of 
late adolescence, diagnosis will be aimed at identifying the pattern most 
predictive of success, and counseling will consist in part of coordinating 
this pattern with the student's claimed choices and with the relevant cur- 
ricular offerings in the institution. 

As another example, consider therapy aimed at reducing the disabling 
effects of a severe emotional state. Such therapy must ultimately take into 
account the origin and degree of modifiability of the personality structure 
of the student showing emotional conflict. Therapy is inescapably affected 
by these factors. 

In the more generalized sense, the use of measurement in counseling 
permits inferences regarding the differential growth points, areas of mod- 
ifiability, and pattern of organization of the individual student's behavioral 
elements. Insofar as measurement has contributed to an understanding of 
this fundamental problem in psychology generally, measurement contributes 
to its understanding in the case of the individual student. 

In the second place, measurement has permitted advances in knowledge 
of the organization of mental life. Operationally, the relatively low in- 
tercorrelations at the college level between measures of ability, aptitude, 
personality, and interests indicate that these may be viewed as distinct 
components which must be separately assessed. Within a behavioral element, 
further analytic breakdowns are possible, illustrated by factor analysis 


studies of ability tests, personality tests, and interest tests. Furthermore, the 
same experimental designs that yield significant results in the study of 
measured behavioral components can equally well lead to clearer definitions 
of existing criteria of adjustment in complex life situations associated with 
educational, vocational, or personal success. 

Again, the counselor utilizes this basic knowledge of the organization of 
mental life in helping the student arrive at workable solutions to his adjust- 
ment problems. The issue here is one of choice of tests and other appraisal 
methods that will provide the clearest and most incisive picture of the sig- 
nificant behavior in the area of ability, achievement, aptitude, personality, 
or interests. 

Human learning is the third fundamental research problem in human 
adjustment. Progress in construction of achievement tests is related to prog- 
ress in our understanding of learning. Comparing and contrasting grades 
and standard achievement test scores as separate indices of learning will 
provide the counselor and student with significant insights into the educa- 
tional progress of the individual. The conditions of effective learning, the 
relation of abilities to amount of learning, the problems of emotional 
learning and re-education, the motivation toward learning — these are of as 
much concern to the counselor as to the theoretical psychologist or the 
educator. In higher education, where learning is especially valued in the 
institutional structure, many student problems are related to the creation 
of conditions for effective learning. This situation obligates the counselor 
to be able to render assistance when and as needed by the students who 
seek counseling assistance. In this task the use and interpretation of achieve- 
ment tests are sound and necessary. 

The dynamics of personality is the fourth research problem in human 
adjustment in which measurement has played a role. In today's discussion 
of objective versus projective, or structured versus unstructured tests, the 
underlying psychometric properties of reliability, validity, economy, and 
normative reference still obtain, even though the dynamic emphasis is 
causing at least a critical review of these properties and their statistical 
demonstrations. In his day-to-day clinical work, the counselor constantly is 
concerned with working examples of the research problem. The under- 
achieving student, the student persevering in an occupational choice in spite 
of failure, the overachieving student, the student afraid to recite in class, 
the social isolate, or the rejected student, each may pose a problem of 
personality dynamics to which measurement devices afford clues, insights, 
or diagnostic definitions that are ultimately helpful in efifective therapy. 

Essentially, these brief statements of fundamental research aspects of 
human adjustment, and the psychometric contributions to their solution, tell 


something of the equipment the counselor brings to his task. They represent 
in part the kinds of information which sooner or later must be supplied to 
the student in need of help so that workable solutions to recognized adjust- 
ment problems may emerge. 

The counseling interview is the primary vehicle by which this knowledge 
is organized and transmitted to the student. The counseling interview is 
also the vehicle in which much of the therapeutic endeavor is contained. 
While the primary purpose of this chapter is not to discuss therapy and 
evaluation of counseling, it is essential to touch upon these topics briefly 
insofar as they relate to the use of measurement in counseling. 

Implications for Counseling Practice 

A student faced with a particular problem ordinarily must choose a 
means of attack that will provide some satisfactory solution. For example, 
in moving from high school to college, the problem of new study habits 
may occur. First, the student must recognize this new adjustment demand 
clearly; then he may attempt to build new habits of study to meet the prob- 
lem. Or when he has to make a curricular and vocational choice, he may 
ask his friends, follow a family suggestion, estimate his own abilities and 
interests, and ultimately arrive at a decision. He may seek medical attention 
for excessive fatigue, weight loss, or eye strain. In each instance, the two 
processes of identifying the problem and seeking treatment for the problem 
are characteristic. When the student comes to the counselor and makes his 
statement of the problem, two people become directly involved in the 
processes of identification and treatment. The counselor brings to the task 
not only the measurement skills and fundamental knowledge described 
earlier, but also some experience in diagnosis and treatment methods. Rela- 
tively little is known about the outcomes of specific treatments; the literature 
is more informative regarding various diagnostic categories. The relation 
between diagnosis and specific treatment is poorly defined also, but appears 
to be a significant field for research. 

A few examples will illustrate this relationship. When the client is facing 
a difficult but straightforward problem of vocational choice or planning, 
he may reach a wiser decision by considering the results of psychological 
tests and data on previous development and achievement that provide some 
basis for assessing his potentialities in terms of the requirements of an 
occupation, the interests and characteristics of those in the occupation, and 
other related information. At the same time, the tests may furnish informa- 
tion regarding limitations of ability, unexpected assets, and inappropriate 
interests or attitudes, all of which are relevant to counseling. Such informa- 
tion can then be used by the counselor to assist the client to develop insights 


into his own pattern of psychological assets and liabilities as they relate to 
various occupational and other environmental demands. 

It is not to be inferred that possession of clinical data which are yielded 
by measurement must be followed by prescriptive counseling procedures, in 
which the course of action which seems wisest to the counselor is mapped 
out, and the client persuaded to accept it and act upon it. It seems obvious, 
however, that in the realm of educational and vocational counseling what- 
ever accurate information the counselor can provide the client will be help- 
ful in arriving at a basis for projected action. Skill in counseling involves 
ability to transmit such information in ways which facilitate its acceptance 
and use by the counselee. 

Admittedly, there are certain client problems which involve primarily a 
conflict of motives, ambivalence of feelings, or repression of emotional 
tendencies, which can be greatly aided by wise counseling in which a meas- 
urement approach has minimum value from the point of view of therapy. 
But before this type of therapy is chosen, the application of psychological 
tests and the review of other clinical data may validly define the area of 
the client's problem. 

Counselors must often make the decision regarding when clients should 
be referred for psychiatric consultation. When tests of neurotic tendencies 
and the judgments made by counselors fall in line, it is possible for the 
counselor to proceed with much more assurance regarding the use of re- 
ferral in case work. When measurement results are at variance with inter- 
view findings, it is usually possible to draw somewhat more extensively on 
clinical and development data to make a more positive verification. 

In the making of diagnoses, counseling problems may be categorized 
according to various frames of reference. Such categories may have the 
weakness of pigeonholing clients, but they have the very real value of 
providing a basis for possible validation studies and studies of the relation 
between diagnosis and therapy. For example, Bordin (4) has suggested 
that a sociological system of diagnostic classification be replaced by a more 
purely psychological one. Such an approach is particularly useful to the 
counselor and student with problems which can be worked out wholly 
through the counseling relationship. Sociological or situational categories 
may also serve as the basis for diagnosis, as have been described by Wil- 
liamson and Darley (27) and by Williamson (23). Bordin (4) has 
pointed out that in each counseling situation the counselor is faced with a 
choice of treatment, and discusses this problem in relation to the diagnostic 
categories used. t ^'CLfl 

Problems which come to the attention of counselors may be those reiatea ^^ 

to the difficulty or inability of the student to cope with a restricted range 


of problems in his social environment, or they may be problems represent- 
ing blocks in the way of student growth and development. Accurate diag- 
nosis should include an understanding, not only of the area of difficulty 
and the indications of maladjustment (for example, failing grades, unem- 
ployment, inability to concentrate), but also of dynamic factors related to 
the maladjustment (for example, inferior scholastic aptitude, lack of skill 
or occupational aptitude, repressed sexual strivings, repressed hostilities). 
It is only through a knowledge of such lacks, conflicts, or inappropriate 
attitudes that a valid diagnosis can be reached. If the counselor is to assist 
the client to develop an awareness and acceptance of these dynamic factors, 
with the expectation that appropriate steps on the part of the client will fol- 
low such insights, measurement must aid in the more objective determina- 
tion of these factors in order to provide a basis for therapeutic approaches 
and for research in therapy. 

Measurement in Evaluation of the Outcomes of Counseling 

During the past two decades there have been numerous attempts at 
evaluating the results of counseling. A general review of methodological 
approaches which might be used has been made by Williamson and Bordin 
(23; see also 26), who suggest utilizing the criterion of clinical judgment 
of adjustment as a basis for evaluating the results of counseling. In numer- 
ous earlier studies, college counseling programs have used grade averages in 
assessing the outcomes of counseling, on the assumption that students who 
have been well counseled are likely to select programs which are better 
adapted to their abilities, and in which they will be better motivated; the 
higher level of adjustment would reflect itself in better work in the curricu- 
lum. This is a reasonable assumption, but suffers in its application from the 
fact that grades in high school and college are quite unreliable indicators of 
the values which the students may have received from training. 

Possibly a more direct, scientific, and reliable approach to the problem of 
validating counseling is through psychological measurement or appraisal, 
both before and after counseling, coordinated with the use of appropriate 
control groups. Such measurement might well include not only tests of edu- 
cational achievement, but also measures of social and emotional adjust- 
ment, and of specific social attitudes. This approach has been urged by 
Williamson and Bordin [24; see also 26) in the consideration of problems 
attendant upon the evaluation of individual counseling programs which par- 
ticularly emphasize educational and vocational guidance and planning; and 
also by Rogers (19) as a means for checking outcomes of "client-centered" 
counseling in which a nonmeasurement approach is utilized in dealing with 
problems of a personal nature. Rogers feels that tests should not be used in 


therapy, holding that the beneficent effects of counsehng are derived from 
self-insights gained by the chent independent of information or interpreta- 
tions from the counselor, who serves primarily in the role of stimulating 
and reflecting the client's expression of feelings, and as one on whom such 
insights can be "tested." 

Most studies evaluating the outcomes of client-centered counseling have 
made use of methods which attempt to quantify interview data, with the 
inevitable difficulty that a valid criterion must be set aside, while the nature 
of the counseling process, rather than its effectiveness, is revealed. The 
investigation of Snyder {21; see also 20) appears promising, however, in 
that client responses are shown to be possible of categorization in such a 
way as to yield a criterion, the validity of which few counselors would 
seriously question. Combs (6; see also 7) has demonstrated in an explora- 
tory study the use of a personality questionnaire as a check on the client's 
progress toward adjustment; in this study, as should be true in all applica- 
tions of measurement to counseling work, the principle that tests are cross- 
sectional evaluations which can be used to chart progress is upheld. 

It seems likely that in the long run some adaptation of the test-retest 
experimental design, using control groups and, if possible, discrete diag- 
nostic categories, will appear more frequently as a method of evaluating the 
outcomes of counseling and of the different therapeutic possibilities sub- 
sumed under the general heading of counseling. 

Selected References 

1. Allport, Gordon W. Personality: A Psychological Interpretation. New York: Henry 
Holt & Co., 1937. 

2. Berdie, Ralph F. "Judgments in Counseling," Educational and Psychological Meas- 
urement, 4: 35-55, 1944. 

3. BiXLER, Ray H., and Bixler, Virginia H. "Test Interpretation in Vocational Counsel- 
ing," Educational and Psychological Measurement, 6: 145-55, 1946. 

4. BORDIN, E. S. "Diagnosis in Counseling and Psychotherapy," Educational and Psy- 
chological Measurement, 6: 169-84, 1946. 

5. BoRDiN, Edward S., and Bixler, Ray H. "Test Selection: A Process of Counseling," 
Educational and Psychological Measurement, 6: 361-74, 1946. 

6. Combs, Arthur W. "Follow-up of a Counseling Case Treated by the Non-Directive 
Method," Journal of Clinical Psychology, 1: 147-54, April 1945. 

7. CowEN, Emory L., and Combs, Arthur W. "Follow-up Study of Thirty-two Cases 
Treated by Non-Directive Psychotherapy," Journal of Abnornial and Social Psychology, 
45: 232-58, April 1950. 

8. Darley, John G. Clinical Aspects and Interpretation of the Strong Vocational In- 
terest Blank. New York: Psychological Corporation, 1941. 

9. . Testing and Counseling in the High School Guidance Progra?n. Chicago: 

Science Research Associates, 1943. 

10. Dysinger, W. S. "Test Reports in Educational Counseling," Educational and Psycho- 
logical Measurement, 3: 361-65, 1943. 

11. . "Two Vocational Diagnoses Compared," Occupations, 22: 304-8, 1944. 

12. Erickson, Clifford E. (ed.) A Basic Text for Guidance Wdrkers. New York: 
Prentice-Hall, 1947. 


13. Froelich, Clifford P., and Benson, Arthur L. Guidance Testing. Chicago: Sci- 
ence Research Associates, 1948. 

14. KiTSON, H. D. "Aptitude Testing: Its Contribution to Vocational Guidance," Occu- 
pations, 12: 60-64, 1934. 

15. MuENCH, George A. Evaluation of Non-Directive Psychotherapy by Means of the 
Rorschach and Other Indices. (""Applied Psychology Monographs," No. 13.) Stan- 
ford, Calif.: Stanford University Press, 1947. 163 pp. 

16. Paterson, Donald G.; Schneidler, Gwendolen G.; and Williamson, Edmund 
G. Student Guidance Techniques. New York: McGraw-Hill Book Co., 1938. 

17. Pepinsky, Harold B. The Selection and Use of Diagnostic Categories in Clinical 
Counseling. (""Applied Psychology Monographs," No. 15.) Stanford, Calif.: Stanford 
University Press, 1948. l40 pp. 

18. Rogers, Carl R. Counseling and Psychotherapy. Boston: Houghton Mifflin Co., 1942. 
450 pp. 

19. . '"Psychometric Tests and Client-Centered Counseling" Educational and 

Psychological Measurement, 6: 139-44, 1946. 

20. Snyder, William U. "'A Comparison of One Unsuccessful with Four Successful 
Non-Directively Counseled Cases," Journal of Consulting Psychology, 11: 38-42, Jan.- 
Feb. 1947. 

21. . "An Investigation of the Nature of Non-Directive Psychotherapy," Journal of 

General Psychology, 33: 193-224, April 1945. 

22. Stead, William H.; Shartle, Carroll L. ; and Others. Occupational Counseling 
Techniques. New York: American Book Co., 1940. 

23. Williamson, E. G. How To Counsel Students. New York: McGraw-Hill Book Co., 
1939. 562 pp. 

24. Williamson, E. G., and Bordin, E. S. "Evaluating Counseling by Means of a Control 
Group Experiment," School and Society, 52: 434-40, Nov. 2, 1940. 

25. — . "Evaluation of Vocational and Educational Counseling: A Critique of the 

Methodology of Experiments," Educational and Psychological Measurement, 1: 
5-24, Jan. 1941. 

26. . '"A Statistical Evaluation of Clinical Counseling," Educational and Psycho- 
logical Measurement, 1: 117-32, April 1941. 

27. Williamson, E. G., and Darley, J. G. Student Personnel Work. New York: Mc- 
Graw-Hill Book Co., 1937. 313 pp. 

4. The Functions of Measurement in 
Educational Placement 

By Henry Chauncey and Norman Frederiksen 

Educational Testing Service 

Collaborators: Benjamin Bloom, University of Chicago; A. B. Craw- 
ford, Yale University; Henry S. Dyer, Harvard University; John M. 
Stalnaker, Association of American Medical Colleges 

The increasing importance of measurement in educational 
placement is attested by the rapid growth of established testing organizations 
and the founding of new ones, and by the number of local and regional 
testing programs which are in regular operation. The annual printing of the 
American Council on Education Psychological Examination alone runs to 
several hundred thousand copies. The Cooperative tests, measuring achieve- 
ment in a wide variety of subjects, are being employed extensively by 
schools and colleges throughout the country. State-wide testing programs, 
such as those of Minnesota, Iowa, Ohio, and Wisconsin, are assuming ever 
greater significance in guidance and placement, and in the evaluation of 
educational programs. 

What has caused this rapid growth? It results in part from improved 
procedures of test handling and from the development of mechanical- 
scoring devices. The adoption of "objective," or restricted-answer, exami- 
nations made it possible to reduce test scoring to a clerical operation, thus 
considerably lowering costs and increasing the reliability of scoring. The 
invention of scoring stencils and test-scoring machines has further reduced 
scoring time and costs, while the employment of Hollerith equipment has 
facilitated the rapid reporting of test scores and the analysis of data on a 
scale which would otherwise be well-nigh impossible. 

But there is a more basic reason for the new importance of testing: educa- 
tors have rapidly come to realize that tests offer unique help in many prob- 
lems of counseling, guidance, and, particularly, of placement. This chapter 
deals with measurement as applied to educational placement. More specifi- 
cally, it concerns itself with the use of tests in ( 1 ) admission to colleges and 
to professional schools, (2) articulation, and (3) classification. As the sub- 
ject of testing in guidance has been covered adequately in the preceding 
chapter, it will be mentioned here only in brief. 



Use of Measurement in Admissions Procedures 

College Admissions 

Some state-supported colleges and universities are required by law to 
accept all students who have been graduated from an accredited high school 
in that state. Mere graduation from high school, however, is insufficient 
evidence of ability to cope satisfactorily with college work, particularly since 
the standards of graduation vary from school to school. In such a situation 
the college must either lower its academic standards or each year fail a con- 
siderable number of students. Both these procedures, as well as a compro- 
mise solution, have been used in dealing with the problem. At some univer- 
sities the problem has been met by setting up special courses, sometimes 
vocational in nature, for those students whose admission is required by law 
but who are unable to do "college-level" work. 

The procedure of eliminating a large proportion of students on the basis 
of the grades in the first term or half-term, that is, of tryout selection, has 
obvious advantages. The validity of the procedure is unquestionable insofar 
as course grades are valid and reliable; correlational studies show a higher 
relationship between first-term grades and measures of later academic suc- 
cess in college than is usually found between selection test scores and col- 
lege grades. For example, in a recent study at Princeton a correlation of .78 
was found between first-term average and average grade at graduation. On 
the other hand, the objections to the tryout are not only that it is inefiicient 
and costly, but, even more important, that it may cause a serious sense of 
failure and frustration in many students. 

r A university which has a large number of applicants, which is not re- 
quired by law to admit those students whom it considers substandard, and 
which wishes to preserve high standards of achievement, is faced with the 
necessity of devising methods of predicting college success in order to screen 
out those candidates least likely to succeed. During a period when young 
men and women in ever-growing numbers are clamoring for admission to 
college, this problem of selection becomes crucial. College admissions offi- 
cers are turning more and more to objective tests and other quantitative data 
for aid in its solution.^ 

In those universities with a selective admission policy, an attempt is usual- 
ly made to admit only those applicants who are likely to achieve academic 
success. In addition, there is usually the implication that admission prefer- 
ence is given to those students who possess personal qualities which are 
likely to make them effective members of the social community. 

With the growth of junior colleges, an increasing proportion of students 
are entering universities at the third-year level. More students are current- 


ly being admitted to the University of California as juniors than as fresh- 
men. The problem of admission at the junior level has, however, until 
recently received relatively little attention. 

High school data 

The applicant is almost always required to present some evidence regard- 
ing his school record. Occasionally the diploma, or evidence that the 
diploma has been granted, is all that is required. More often, a complete 
transcript of the school record must be submitted. The use of the transcript 
varies considerably. The admissions officer may consider it, like the diploma, 
merely as evidence of graduation. Or, more often, he employs a consider- 
able amount of the data thereon in evaluating the candidate's academic 
qualifications, or in checking whether special entrance requirements, such 
as a specified number of units of mathematics, have been met. 

The individual subject grades on the transcript may also be used in a 
more direct way to estimate the applicant's academic promise either for col- 
lege work in general or for work in specific fields. A generally high record 
in secondary school is likely to be associated with successful performance in 
any sort of college program. More specifically, evidence of successful com- 
pletion of courses in mathematics and science, for example, may be con- 
sidered as a fair indication that an applicant can handle the program of an 
engineering school. 

Two other types of school data are often used — average school grade and 
rank-in-class. In general, predictions of college achievement from high 
school achievement have been found to be fairly accurate; but predictions 
based on average grade are inferior to predictions from rank-in-class. The 
reason is that variations occur in standards and in marking systems from 
school to school, which are uncorrected either with the ability level of the 
students or with subsequent measures of success in college. 

Rank-in-class is more predictive than average grade because it eliminates 
some of the variability due to differences in grading practices. It can be used 
in raw form (for example, 23/181 or twenty-third in a class of 181) or it 
can be reduced to percentile rank by computing the ratio of standing in class 
to size of class. If the rank-in-class data are to be used as terminal statistics 
in the subjective evaluation of applicants, the percentile rank is more con- 
venient and meaningful than the "raw" rank since it permits a direct com- 
parison of students from classes of different size. That is, it assumes that the 
student who stands at the 90th percentile in a class of 1,000 is approximate- 
ly equal in achievement to the student wdio stands at the 90th percentile in a 
class of 50. If, however, the rank-in-class information is to be used statis- 
tically, either in correlation analyses or as one element in a weighted com- 


posite of quantitative data, it is necessary to convert the percentile ranks to a 
standard score on a sigma scale. In many situations it is desirable to report 
the rank-in-class both as a percentile rank and as a standard score. 

Although rank-in-class, however expressed, overcomes the error due to 
variation in grading practices, it is still susceptible to errors arising from 
differences among schools in the average quality of instruction and the 
average caliber of the students taught. A high-ranking student in a school 
whose pupils have an average IQ of 125 is likely to be more able than a 
high-ranking student in a school whose pupils have an average IQ of 95. 
Even more serious, however, may be the errors which result from a lack of 
uniformity in the procedures used to determine rank-in-class. School A may 
base the rank-in-class on all students in the school, regardless of the curricu- 
lum which they are studying; school B may base the rank only on students 
in the college preparatory curriculum; school C may compute the rank from 
the average of the grades obtained in academic subjects alone; while school 
D may compute it from the averages of the grades obtained in all subjects, 
including art, shopwork, and physical education. Still another source of 
error arises from the fact that different schools may use different methods 
for reporting the rank of students who are tied for the same position. In 
spite of all these difficulties, however, rank-in-class is usually the best single 
predictive index available to the college admissions officer. Correlations of 
the order of .55 are commonly found between rank-in-class and measures of 
achievement in college. 

The predictive value of rank-in-class can be improved somewhat by 
introducing corrections, based on studies at a particular college, of the 
achievement of students coming from certain schools. If students from a 
particular school are found on the average to achieve grades one group 
higher than the general average, then in the case of a student from that 
school, a corresponding amount would be added to his predicted grade. At 
Princeton and Yale, for example, studies have been made of every school 
from which students come in fairly large numbers, and corrections assigned 
on the basis of past achievement of students from each school. At Harvard 
it has been found adequate to correct only on the basis of whether the 
school was public or private. Further studies of the relative value of the 
two methods are in progress. Rank-in-class predictions of freshman grades 
which contain corrections for the specific school or type of school tend to 
correlate about .60 with actual freshman grades. This correction plan does 
not, however, help in evaluating the records of students from unfamiliar 
schools; and in familiar schools it may lead to error when the grading 
standards shift or the caliber of students changes. 


Measures of aplilz/de and achievement 

The prediction of academic success in college on the basis of high school 
record is fairly successful because of the similarity of the two situations 
involved. Practically all factors related to academic success in high school — 
motivation, personal adjustment, study methods, and aptitude — are also 
operative in college. Prediction on this basis is still far from perfect, how- 
ever; motivation and quality of adjustment may change, study methods 
and aptitude which are adequate for high school work may be inadequate 
for college work, and study techniques may be improved. It is desirable to 
assess these various factors independently in order to have a more adequate 
basis than school record alone for admission to college and for individual 
guidance of the student after he has been admitted. Prediction can fre- 
quently be improved somewhat by including with rank-in-class in a 
multiple regression equation independent measures of aptitude, achieve- 
ment, or proficiency in tool skills such as reading and writing. 

It is impossible to make hard and fast distinctions among tests described 
as tests of scholastic aptitude, tests of achievement, and tests of proficiency 
in tool skills. For improving prediction of success in college, all are used 
for about the same purpose, and there is considerable overlapping in con- 
tent. A test involving reading comprehension might be published under 
the title of "reading test" or "scholastic aptitude test" with equal justifica- 
tion. Therefore, in this discussion no rigorous attempt will be made to seg- 
regate tests of the various types. 

As implied above, most tests used as measures of general scholastic apti- 
tude contain a variety of materials, generally including some verbal and 
some mathematical items, and possibly items relating to tool skills such 
as reading comprehension. The justification for calling any test one of 
aptitude rather than of achievement is usually that the specific items are not 
definitely related to the content of specific high school courses, although 
the claim cannot be safely made that the scores on such tests are unaffected 
by school training. An example of such a test, which is widely used, is the 
American Council on Education Psychological Examination for College 
Freshmen. This test yields two scores, a Q (quantitative) score and an L 
(linguistic) score, as well as a total score based on all the subtests. The 
Scholastic Aptitude Test of the College Entrance Examination Board also 
yields two scores, designated as V (verbal) and M (mathematical). Un- 
like most tests of aptitude and achievement, which are sold to any qualified 
examiner for his own purposes, the SAT is administered solely by the 
College Board at its own examination centers, to upward of 70,000 
candidates for admission to colleges each year. The Ohio State University 



Psychological Test, prepared for the Ohio College Association, differs from 
the tests mentioned above in that it is entirely verbal and is administered 
without a time limit. 

Many studies have been made of the validity of such tests of scholastic 
aptitude, and it would be possible to cite page after page of validity co- 
efficients. These coefficients vary from zero to as high as .70. This vari- 
ability of results is due to many factors, such as the range of talent rep- 
resented in the group studied, the nature and reliability of the criterion, 
and the suitability of the particular test to the specific situation studied. 


Validity Coefficients of the C.E.E.B. Scholastic Aptitude 
Test for Harvard Students 


Correlations with Freshman Grades 

Type of Student 

Verbal Score 

Mathematical Score 

Public school 

Public school 

Public school 

Private school 





■ 25 

Private school 


Private school 


It is probably fair to say, however, that the median correlation of tests 
such as those mentioned above, using freshman grades as the criterion, 
would fall somewhere near .45. 

Some recent validity coefficients reported by Harvard on the College 
Board's Scholastic Aptitude Test for several entering groups are shown 
in Table 2. 

In a similar study at Princeton, based on 476 liberal arts students in a 
recent freshman class, the correlations of the verbal and mathematical 
scores with first-term freshman average grade were .51 and .30 respectively. 
For freshman engineering students, the analogous validity coefficients were 
.42 and .44. 

The scholastic aptitude type of test discussed above is ordinarily used 
to predict over-all success in college. However, in those cases where the 
test yields both a linguistic and a quantitative score, some differential pre- 
diction may be possible if the two scores are treated as separate observa- 
tions of the applicant. Normally, a linguistic (or verbal) score is more 
highly predictive of success in the social sciences and the humanities than 
of success in the physical sciences, and the reverse is true of a quantitative 
score. Where the situation requires more refined differentiation, tests of 


achievement In particular subject fields may be needed in addition to the 
aptitude tests. Such achievement tests may, indeed, serve as important 
supplements to measures of scholastic aptitude, since they operate on the 
principle that a reliable and unbiased measure of past performance in a 
given area provides one of the best means for predicting future perform- 
ance in the same area. 

The Cooperative Test Division of the Educational Testing Service offers 
a variety of such tests, including not only tests of achievement in specific 
subject fields at various educational levels, but also tests designed to meas- 
ure proficiency in the mechanics of writing, reading comprehension, gen- 
eral proficiency in mathematics and foreign languages, and general cultural 
background. Still other measures of differential aptitude are represented 
by the Psychological Corporation's Differential Aptitude Tests, the Guil- 
ford-Zimmerman Aptitude Survey, the Chicago Tests of Primary Mental 
Abilities, and the Yale Scholastic Aptitude and Achievement Tests. As a 
part of its regular entrance examination series, the College Entrance 
Examination Board also prepares achievement tests in English, social 
studies, biology, chemistry, physics, mathematics, and the foreign lan- 
guages. From the ten tests available, the candidate, on the advice of his 
school or prospective college, may select three. 

A number of regional testing programs, usually state-wide, likewise 
provide the college admissions officer with a wealth of useful data. The 
New York State Regents' Examinations, for example, provide a method 
of assessing the achievement of any graduate of a high school in the state, 
while the Fall Testing Program for Iowa High Schools furnishes scores 
on a set of general educational development tests. These tests, administered 
throughout the four years of high school and in the college freshman year, 
are planned to yield not only measures of proficiency at each particular 
year, but indications of relative growth or improvement over a period of 
years as well. 

It is difficult to summarize briefly the vast amount of data dealing with 
correlations between achievement tests and various measures of college 
success. Considerable variation in predictive value is found, because of such 
factors as the appropriateness of the test, the range of talent in the group 
studied, and the method of measuring college success. From a battery of 
achievement measures we get about as good a prediction of freshman 
average or other general measure of college success as we would get from 
rank in high school class; a validity coefficient of about .55 would be 
typical. Harvard follows the practice of averaging the College Board 
achievement test scores of its candidates for use in a multiple regression 


equation. The zero-order correlations of these averaged scores with fresh- 
man grades for several entering groups are shown in Table 3. 


Correlations of Averaged C.E.E.B. Achievement Test 
Scores with Harvard Freshman Grades 

Type of Student N r 

Public school 515 .55 

Public school Ill .55 

Public school 119 .56 

Private school 507 . 54 

Private school 120 .60 

Private school 170 .54 

It should be noted here that the prediction of success in a particular 

course from the appropriate achievement test may be somewhat better than 

the general prediction; this is particularly true in physical science and 

mathematics. In these fields it is not unusual to obtain correlations in the 

neighborhood of .60. Some correlations found at Harvard and Yale are 

cited in Table 4. 


Correlations of C.E.E.B. Subject-Field Achievement Test Scores 
WITH Harvard and Yale Freshman Grades 









C.E.E.B. Mathematics 

C.E.E.B. Chemistry 

C.E.E.B. Physics 

Freshman mathematics 
Freshman chemistry 
Freshman physics 






.59 ^ 


Those working in the field of predicting scholastic success in college 
have felt that there are definite limitations to the use of scholastic aptitude 
and achievement tests. It has been estimated by those who work under 
conditions as nearly ideal as we can expect that their highest potential pre- 
dictive value is represented by a coefficient of around .75. And, in fact, even 
when the best of present achievement and aptitude tests, whose reliability is 
known to be high, are combined to predict grades, it is seldom possible con- 
sistently to attain validity coefficients of more than .70. 

Assessment of personal qualities 

From the foregoing it is apparent that even when the most valid meas- 
ures of aptitude and achievement are used there still remains an unpre- 
dicted variance in average college grades which amounts to approximately 
one-half of the total variance. It seems probable that this unpredicted por- 
tion is due largely to such factors as persistence, motivation, personal ad- 


justment, interest, and study methods — factors difficult to quantify and 
measure. Probably little further improvement in the prediction of college 
success can be expected until reasonably valid and reliable measures of 
such personal qualities have been devised. Such measures are needed not 
only to improve the prediction of academic success, but also to identify the 
desirable campus citizen, the individual who can succeed with his fellows 
as well as in his courses. Perhaps no group of persons is more acutely aware 
of this need than the admissions officers themselves, and their attempts to 
meet it have been usually earnest although largely unsuccessful. 

The application blank, which frequently contains certain items presumed 
to yield some insight into personal qualities, is usually thought to be of 
some value. The applicant may be required to answer questions dealing with 
extracurricular activities, elective offices he has held in school, educational 
and vocational objectives, war service, age, and the like. He may even be 
asked to write a brief autobiography, or a paragraph on why he chose a 
particular college, or on why he wants to go to college. However, very 
little work has been done in validating the items of information on the 
application blank. Presumably a biographical questionnaire could be pre- 
pared which would have some validity for predicting college success; the 
ordinary application blank probably contributes little, if anything, to such 
a prediction. 

Often required as part of the application for admission are letters of 
recommendation. Such letters are potentially of value, but their very source 
limits their usefulness. If the applicant is required to provide letters of 
recommendation, he will undoubtedly apply to friends or acquaintances 
who will write as glowing accounts of his abilities as possible. Such letters 
thus have limited value. Letters requested specifically by the university 
from the school principal are likely to be more useful, although much 
depends on the person writing the letter. The principal may wish to place 
as many of his students as possible, and so write uniformly favorable recom- 
mendations. On the other hand, he may try conscientiously to write an 
honest account of each candidate's qualifications. The experienced admis- 
sions officer eventually will learn to rely on letters from certain principals 
and discount recommendations from others. The ordinary letter of recom- 
mendation also suffers from the defect of lack of uniformity and systematic 
coverage of various aspects of ability, achievement, motivation, personal 
and social adjustment, and the like. In many cases the principal may not 
be qualified, because of lack of personal knowledge of the candidate, to 
make such evaluations. Furthermore, the principal's knowledge of the 
relative excellence of candidates tends to be restricted to the range of 
talent with which he normally deals. An applicant who appears to him 


outstanding in comparison with other students in liis school may be 
mediocre or worse when compared with all the candidates competing for 
admission to a particular college. Little systematic study has been made of 
the predictive value of letters of recommendation. 

Rating scales have been developed for use at some institutions in stand- 
ardizing the types of information ordinarily given in letters of recom- 
mendation and in standardizing the method of reporting the characteris- 
tics of the applicant. It has been found in practice that the ratings usually 
seem to reflect high school achievement rather than individual estimates 
of personality characteristics. 

Another device frequently used in assessing personal qualities is the 
interview. The typical interview is unstandardized and of doubtful worth, 
however, unless its objective is merely to assure that the candidate does not 
possess obvious defects in appearance, speech, or physique. In connection 
with college admissions the interview is usually a somewhat hurried and 
haphazard affair undertaken by an enthusiastic alumnus whose judgment 
is likely to be clouded by other than academic interests, or by college 
"representatives" who see the applicants in groups under conditions that 
do not lend themselves to frank discussion and appraisal. There is some 
reason to believe, however, that when the interviewer is trained for the 
job and is permitted to work under good conditions, usable information 
about the candidate's personal qualifications may be obtained. Recent 
studies of evaluating officer-like qualities for Army and Navy personnel 
suggest that reliable and valid judgments of observable personal qualities 
can be made from a careful interview. 

Personality schedules and inventories are rarely used in college admis- 
sion procedures. Various studies have been made dealing with the predic- 
tive value of those personality tests which yield presumptive measures of 
dominance, submissiveness, neurotic tendency, and the like. The results 
have usually indicated that such tests bear negligible relationship to col- 
lege grades, although they may be useful in other ways. Studies dealing 
with measures of academic interests and of drive or persistence have yielded 
somewhat more hopeful results, and various other attempts to measure 
personal qualities have proven to be of some worth. Preliminary studies 
with the Rorschach test, such as those by Munioe (11) at Sarah Lawrence 
College, seem to indicate that further studies of projective techniques may 
prove fruitful. 

Combinations of measures 

The various kinds of data employed by the admissions officer are, of 
course, used in combination. In a typical situation, the data available for 


each applicant might include an application blank, letters of recommenda- 
tion, a record of impressions from an interview, a score or scores from a 
scholastic aptitude test, a transcript of the high school record, and rank 
in high school class. How are all these variables to be used in deciding 
whether or not the applicants are to be admitted? 

A procedure which may be fairly typical in many colleges is simply to 
sort the folders containing all the available information into groups, based 
on a subjective evaluation of each individual case. One pile contains the 
folders of applicants whose qualifications clearly justify admission; an- 
other contains folders for all who are clearly too poor to be admitted; and 
the third contains doubtful cases. The next step is to mull over the doubt- 
ful cases and to make additional sortings until enough folders have been 
added to the "admit" pile to meet the quota for that particular term. Some 
such procedure is necessary when the data are not quantitative; the sorting 
is a process of quantifying the data by a ranking or rating procedure. 

When part of the data are quantitative, such as rank-in-class and apti- 
tude test score, it is possible to make a definite prediction of the grade 
of each candidate on the basis of statistical studies determining the validity 
of these predictors. The validity of the combined measures and their best 
relative weighting in the predictor can be determined through techniques of 
multiple correlation. Using multiple correlations for combinations of rank- 
in-class with various aptitude and achievement test scores, taking freshman 
grades as the criterion, the coefficient is sometimes as high as .75. The task 
of making individual predictions on the basis of such studies may be 
simplified by constructing suitable tables or diagrams. Especially for border- 
line cases the admissions officer may feel that the prediction can be im- 
proved still more by subjectively evaluating interviews, applications, and 
recommendations. Very little attention has been paid to studying the validity 
of this subjective type of evaluation or the extent to which such evaluation 
can improve a prediction made originally from a regression equation. 
Introducing subjective modifications of multiple regression predictions has 
been known to weaken rather than to improve the predictions. 

Evaluation of admissions procedures 

A great deal of study has been devoted to the evaluation of quantitative 
admissions data — measures of scholastic success in high school, and scores 
on scholastic aptitude tests, achievement tests, and tests of proficiency in 
tool skills. A number of universities, such as Harvard, Yale, and Minne- 
sota, have contributed many studies, and the College Entrance Examina- 
tion Board has long been engaged in studies of its own tests. The value 
and limitations of these variables have been fairly well demonstrated. One 


aspect of the selection problem which has tended to be neglected, how- 
ever, is evaluation of personal qualities — motivation, persistence, study 
methods, interests, and personal adjustment. While it is often possible to 
state with assurance, when a prediction failure occurs, that the student in 
question failed because of too many social distractions, or because of a lack 
of objective, or because of emotional problems, it is much more difficult 
to identify such students prior to their admission to college. 

The fact that quantitative measures of personal qualities are not readily 
available is probably the principle reason that little has been done to study 
the relationships between judgments of such qualities prior to the stu- 
dent's admission and his subsequent college performance. Pending further 
experiment in the admission situation with such techniques as the con- 
trolled interview and projective tests, there are rich opportunities for the 
investigator who can persuade boards of admission to make an attempt, 
however crude, to quantify their own judgments of the personal traits of 
each candidate, as derived from the evidence ordinarily available. The 
correlation between such judgments and the ultimate success of students, 
both academically and otherwise, is now unknown. When found for any 
particular institution, it would almost certainly have an important effect 
on admission policy. 

Evaluation of the assessments of personal qualities obviously requires 
quantification. One approach might be to attempt to quantify judgments 
of personal qualities without essentially changing the nature of such judg- 
ments. For example, all that might be required would be to assign numbers 
to piles of folders, as described above, and to record the number of the 
pile for each student's folder. This might be a very crude scale involving 
only two steps in the continuum, the obviously acceptable and the doubt- 
ful (since the unacceptable students would not be admitted and could not 
be studied); or the folders might be sorted into a larger number of piles 
representing steps in a continuum. The subjective judgments based on 
over-all impressions of the data could then be evaluated, using the same 
correlational techniques ordinarily applied to test scores. Similarly, a study 
could be set up to compare predictions obtained from a regression equa- 
tion with predictions obtained after modifying these by applying subjec- 
tively evaluated personal data. 

Perhaps a more hopeful approach would be to attempt to use techniques 
of test construction in assessing personal qualities. The possibilities in 
this area are many. By means of item analysis techniques, using nonaca- 
demic college achievement as an external criterion, a personal data blank 
might be developed which would be considerably more valid than the 


ordinary application blank. Or interviewing techniques might be standard- 
ized so as to yield some reliable and valid ratings of personal qualities. 
Further development of tests of persistence or drive, and a follow-up of 
preliminary studies of Rorschach or other projective tests, again using item 
analysis techniques with an external criterion, might prove profitable. 
While some improvement in prediction of college success may come from 
further refinement of the aptitude and achievement tests, it would seem 
that the greatest advances may come through a thorough exploration of 
the measurement of personal qualities. 

In any serious research program aimed at improving the prediction of 
academic success in college, another important first step is to determine the 
adequacy of the criteria of academic success. What is the reliability of a 
course grade? What is the reliability of an average of course grades? Which 
courses tend to be reliably graded and which unreliably graded? What 
does the grade in a particular course represent in the mind of the in- 
structor who gives it? Is the grade valid in terms of the announced objec- 
tives of the course? Would more than one grade for each student in a 
particular course provide a better description of his knowledge and ac- 
complishments? To none of these questions have the experts in measure- 
ment yet found satisfactory answers. On dubious evidence, the assertion 
is frequently and too easily made that course grades, by and large, are 
unreliable (hence unpredictable) {a) because they are too often based on 
a series of essay examinations whose unreliability is "well known," [b) 
because there is likely to be a large variation in the standards used by 
different instructors in grading, and (r) because grades are often in part 
determined by such "irrelevant" factors as punctuality in the submission 
of assigned work, the degree of alertness in class discussions, and the 
neatness in the preparation of papers. Such notions as these are a part of 
the folklore of educational measurement. Too little effort has been made 
to gather the kind of data which would make possible a thorough-going 
analysis of the situation in any given institution. An important contribu- 
tion can be made by the expert in measurement who begins his evaluation 
of admission procedures by first determining the essential characteristics 
of the criteria they attempt to predict. 

Admission to Special College Programs 

j The problem of college admissions is complicated by the fact that there 
exist various types of college programs requiring varying patterns of 
abilities. An applicant might not possess the proficiency in mathematics 
necessary for successful completion of an engineering course, but be well 


(qualified for a liberal arts program; or he might be sufficiently handicapped 
in reading skills to interfere with success in social studies, but do well 
in a fine arts course.J 

Successful differentiation of students where admission to special pro- 
grams must be made prior to freshman year may depend upon opportuni- 
ties for precollege counseling. The University of Nebraska, which is re- 
quired by law to admit qualified graduates of accredited high schools in the 
state, has set up special programs leading to a certificate in two years, for 
those students who are not qualified for college work at a high level. Be- 
fore registration, a battery of aptitude and achievement tests is adminis- 
tered to all students, and the test scores, together with information relating 
to high school achievement and the like, are put in the hands of counselors. 
Each applicant is then interviewed, and the attempt is made to steer him 
into the program best suited to his abilities and interests. The success of 
such a program of precollege counseling depends not only on the counselor, 
but also on the availability of good tests and dependable descriptions of 
the relationships between test data and success in the various special col- 
lege programs. The establishment in 1932 of what is now known as the 
General College at the University of Minnesota is a more familiar example 
of an attempt to differentiate among students in admissions to various 
college programs. 

As intimated in the preceding discussion of achievement tests, generally 
the best predictor of success in a particular subject-matter field, such as 
physical science or foreign language, is a lower-level achievement test 
in the same or a corresponding field. The validity coefficients in general 
compare to those usually obtained between scholastic aptitude test scores 
and general measures of college achievement. Few studies have been made 
of possible improvement of such prediction by adding variables, such as 
high school scholarship in that field, or measures of interest and motiva- 

At Harvard, however, a minor study was made to determine the degree 
to which high school rank-in-class and the C.E.E.B. mathematical aptitude 
score, when combined with the C.E.E.B. physics achievement score, tended 
to increase the multiple correlation with grades in a freshman physics 
course. For one group of 58 cases the physics score alone produced a cor- 
relation of .50; the multiple correlation was .62. When a revised physics 
test was used with a later group of 48 cases, the physics test score alone 
yielded a correlation of .66, and in combination with rank-in-class and 
mathematical aptitude score gave a multiple correlation of .69. The differ- 
ence between the two cases in the amount of improvement is worth noting. 
An optimally weighted combination of several predictors always improves 


the prediction to some extent, but wliether the improvement is sufficiently 
large to warrant the extra labor of applying multiple regression techniques 
is a matter that must be decided on the basis of evidence pertaining to 
the particular problem in question. 

In certain fields, however, prediction on the basis of aptitude tests has 
not been successful to the degree that it has in such fields as mathematics, 
science, and social studies. Comparatively little advance has been made in 
measuring musical talent, for example, since Seashore's Measures of Musical 
Talent were developed in 1919. A number of tests dealing with artistic 
ability have been produced, and some attempt has been made to measure 
literary discrimination and appreciation. Some of these tests have been 
subjected to considerable critical study, with varying opinions reported 
by various authors. It is probably fair to say that tests in this general area 
are of limited usefulness except when dealing with groups in which a 
wide range of ability is represented, and that at the college level they are 
not likely to have great predictive value. Another limitation in differential 
prediction is in making distinctions within a general field of ability. For 
example, we are not able at present to distinguish on the basis of aptitude 
between engineering and other physical sciences. Conversely, there may 
often be considerable variation in the kinds of talents demanded by differ- 
ent parts of a field grouped under a common label. The prospective chemist 
who decides to specialize in physical chemistry, for example, will require 
many skills and aptitudes which are not possessed by his fellow-student 
concentrating in organic chemistry. Sometimes, but not always, vocational 
interest tests may be of some help in differentiating among fields which 
all require nearly the same pattern of aptitudes. 

The problem of admitting students to schools of engineering has prob- 
ably been studied more extensively than that of admission to any other 
special college program, and the efforts in this area have perhaps been 
most successful. The types of tests most frequently used, aside from those 
of general scholastic aptitude, are mathematical achievement or aptitude, 
spatial relations, and mechanical knowledge or "ingenuity." Since mathe- 
matics is an extremely important tool skill in the engineering curriculum, 
it is not surprising to find that course grades in mathematics or examina- 
tions such as the College Board's Comprehensive Mathematics Test are 
among the best single predictors of success in engineering. 

Another aptitude required in many engineering courses, particularly in 
mechanical drawing and descriptive geometry, is the ability to think in 
terms of three-dimensional space. A number of so-called spatial relations 
tests have been developed in the attempt to measure this important ability. 
The items in such tests are of varied types. They may involve "block- 


counting," i.e., counting the blocks in pictures of irregular stacks of blocks, 
including those blocks not visible but necessary to hold up the structure. 
Another type is "intersections," in which a plane intersecting a solid 
geometrical figure is shown; the task is to choose from alternatives pre- 
sented the figure representing the shape of the resulting cross section of 
the solid. Still another type requires the student to imagine how a pictured 
solid figure would look if it were rotated 90 or 180 degrees in one or 
more directions, while other types of items require judgments on two- 
dimensional figures. Results on the best of such tests may correlate with 
grades in mechanical drawing and descriptive geometry as high as .50 to 
.60, depending upon reliability of the criterion. 

There are published a number of "mechanical aptitude" tests. Most of 
these, such as Stenquist's and O'Rourke's, are heavily weighted with items 
related to knowledge of tools and parts of mechanical objects. If one sup- 
poses that high scores on such a test reflect interest in the realm of me- 
chanics, the test might be assumed to have some value in a differential 
battery. This supposition is vitiated, however, by the wide variation among 
students regarding opportunities for earlier acquaintance with practical 
mechanics. Tests of this type were rather widely used in classifying men 
for military training in World War II and probably would have some 
value in predicting success in school or college shopwork, although they 
were designed primarily for use in industrial selection rather than for 
colleges. The type of item developed by Bennett for his Mechanical Com- 
prehension Tests was also widely used in Army and Navy classification 
tests. Bennett's test, sometimes said to measure "barnyard physics," is fnade 
up of items which pictorially depict the operation of physical principles 
in everyday situations. These measures of mechanical aptitude or compre- 
hension have some value in selecting men for vocational training; they 
have received relatively little attention from engineering schools as aids 
in selection. 

A number of tests of so-called science or engineering aptitude have 
been developed for use at the college level. Some, such as Zyve's Stanford 
Scientific Aptitude Test, are inadequately standardized and validated, and 
probably are of limited value, although certain of the materials seem 
promising. The subsections of this test are too short to yield reliable meas- 
ures of the various aspects of aptitude and achievement which they are 
intended to measure. 

The widely used Pre-Engineering Inventory was developed by the Meas- 
urement and Guidance Project in Engineering Education, which is spon- 
sored by two engineering groups — the Engineers' Council for Professional 
Development and the American Society for Engineering Education. The 


inventory is a battery of tests designed to measure aptitude and achieve- 
ment in areas related to success in the engineering curriculum. Composed 
of six tests, it requires about five and a half hours to administer. The tests 
measure general vocabulary, technical vocabulary, comprehension of tech- 
nical passages, tables, and graphs, ability to perform and interpret quantita- 
tive operations, comprehension of mechanical principles, and ability in 
spatial visualizing. The six scores are reported graphically in the form of 
a test profile showing percentile ranks. Each of the six scores has sufficiently 
high reliability to justify the use of the individual test profile for guidance 
purposes as well as for purposes of admissions. An excellent example of 
what can be accomplished by a serious cooperative research enterprise aimed 
at solving a particular problem in admissions, the inventory has been 
found to have good predictive value and has met with approval by en- 
gineering schools generally. 

At some colleges, limited numbers of students may be admitted to 
special college programs in the upper-class years, such as the Woodrow 
Wilson School of Public and International Affairs at Princeton. In such 
a situation, probably no better method of selection could be employed than 
to use the academic record of the previous years as the basis for predicting 
scholastic success. The correlations between sophomore and junior grades 
in college are ordinarily higher than those between a specific test and 
junior grades. Little attention has been given to other qualifications that 
might be required, at least from the standpoint of measurement. 

^_^ Admission to Graduate Schools 

LThe problem of admitting students to graduate schools is essentially 
the same as that of college admissions, except that more specific abilities 
and interests and higher levels of ability are required. The problem is 
somewhat more difficult from the standpoint of measurement, since a good 
deal of preliminary selection has already occurred. The range is often 
restricted still more by requiring that all applicants meet some fairly 
stringent standard of achievement, such as a B average in college work. 
Within the comparatively narrow range of ability that remains, it would 
be surprising indeed if ordinary tests of scholastic aptitude proved to have 
high predictive value^ 

As is true at other stages in the academic program, the best predictor of 
general scholastic success is the record of academic achievement in the 
preceding scholastic program. Again, however, the accurate assessment of 
competence from grades is made difficult by variations from college to 
college in grading standards and practices, quality of instruction, and 
aptitude of students. 


4t-45-^iew-'possifeie-4o us€ a battel y of-achtevement tests, Jm^PAm-as the 
Graduate Record Examination, to measure objectively college achievement 
in eight fields (mathematics, physics, chemistry, biological science, social 
studies, literature, fine arts, and verbal ability). The battery is suitable for 
use with college seniors or graduates. Advanced subject tests are also 
available; ordinarily a candidate would take in addition to the eight gen- 
eral tests one advanced test, probably that in his major fieldjNow ad- 
ministered by the Educational Testing Service, the Graduate Record 
Examination was initially prepared through the cooperation of the graduate 
faculties of Harvard, Yale, Princeton, and Columbia. The examination 
offers an evaluation of a candidate's general knowledge in various fields 
as well as his proficiency in a chosen field. Prediction of success in graduate 
work on the basis of the G.R.E. is about as good as that obtainable from 
college grades, and prediction based on both G.R.E. scores and college 
grades is superior to prediction from either one alone. G.R.E. is also valu- 
able to the increasing number of graduate schools which are interested in 
the broad cultural background of students as well as their aptitude and 
proficiency in a prospective field of specialization. 

A test for graduate students which is similar in form and purpose to the 
scholastic aptitude type of test discussed above is the C.A.V.D., which has 
long been used to screen applicants for various graduate schools in Co- 
lumbia University. The C.A.V.D. is divided into a number of levels, the 
lowest of which are suitable for use with children in kindergarten and 
the most advanced of which are appropriate for college graduates. It thus 
provides a continuous scale for the determination of general scholastic 
aptitude in both children and adults. At each level, there are four subtests, 
which give the instrument its name : 

C — Completion of sentences by supplying omitted words 

A — Arithmetic: items involving the manipulation of numbers and the solving 

of problems 
V- — Vocabulary 
D — Directions: items which in the upper levels constitute the usual sort of 

reading comprehension test 

The principal difference between the C.A.V.D. and a battery like the 
G.R.E. is that the former attempts to test the student's ability to reason 
with material that is not specifically related to courses he may have studied 
in college. 

Studies of the selection of students for programs of graduate study 
have been confined almost entirely to intellectual aptitudes and abilities, 
although some attention has been paid to motor skills and interests. Com- 
paratively little attention has been paid by graduate schools to factors of 


personality and temperament as they relate to academic or professional 
success, and little has been done in the measurement of aptitude for most 
of the highly specific fields of professional specialization, such as patent 
law, cost accounting, and the various specialties in medicine. 

The Medical College Admission Test, successor to the Professional 
Aptitude Test, is sponsored by the Association of American Medical Col- 
leges and required of candidates for admission by member-colleges. It was 
first administered in 1948 and consisted at that time of four tests of gen- 
eral ability, including verbal ability on scientific, social, and humanistic 
materials, and quantitative ability; and two achievement tests in social 
studies and premedical science. Scores may also be used by approved schools 
of dentistry and pharmacy as well as by the sponsoring medical schools. 

The standards of admission to schools of dentistry are in general some- 
what lower than for medical schools, and a somewhat different pattern of 
ability is required. Skill in many common aspects of the practice of dent- 
istry seems to depend more on motor ability than on high intellectual 
aptitude. A variety of mechanical and motor ability tests have been tried 
with varying results, but the prediction of success in dentistry is generally 
less successful than that in medicine. Tests of intelligence are generally 
found to be more valid than motor tests, although the relatively low 
validity of motor tests may be partly a function of the criterion; grading 
of students on the intellectual aspects of the curriculum is likely to be 
more reliable and, hence, contribute more to final grades than do ratings 
of success in manipulating instruments. However, the possible approaches 
to the problem of measuring motor and perceptual skills have by no means 
been exhausted. 

Widely used for predicting success in law school is the Law School 
Admission Test, developed in 1947 by a number of leading schools and 
now administered to candidates four times a year. This test measures verbal 
ability, including vocabulary, and also reading comprehension, reasoning 
ability (using fictitious legal cases), and ability to discern crucial relation- 
ships. A preliminary validity study shows a correlation of .63 between 
test scores and first-year law school grades. 

Admission to "nonprofessional" graduate schools is typically handled 
on a more informal and subjective basis than admission to undergraduate 
colleges. Each department of a graduate school is usually responsible for 
admitting its own students, and the number of cases therefore tends to be 
small. The typical procedure probably consists of having the members 
of a departmental committee evaluate each candidate subjectively, on the 
basis primarily of his undergraduate transcript, letters of recommendation, 
and whatever test scores are available, then ranking the candidates in order 


of merit. These rankings are later reported at a departmental or com- 
mittee meeting and the disagreements thrashed out; through some more 
or less democratic process a final decision is reached on each candidate. 
The small number of cases usually involved tends to discourage any statisti- 
cal approach to the problem. In the larger graduate schools, however, it is 
possible to make studies designed to evaluate and eventually to improve 
admissions procedures. Such studies are being increasingly performed, and 
the results tend to justify the continued use and development of measure- 
ment techniques in admissions procedures. 

Awarding Scholarships 

I* For some time there has been a growing tendency to subsidize the higher 
education of qualified students, through granting of scholarships, fellow- 
ships, and other types of financial aid/The initial impetus for this tendency 
followed the First World War, when many young people from families 
of limited means sought admission to colleges and universities. With the 
onset of the depression in 1929, when financial resources were limited and 
job opportunities curtailed, many more needy students applied for higher 
education; this new need for financial aid resulted in the establishment of 
larger scholarship funds and even, in some instances, state scholarships. 
Federal aid was granted to many students, beginning in 1934, through 
the Federal Emergency Relief Administration and later through the Na- 
tional Youth Administration. Following World War II government sub- 
sidization of education has been considerably enhanced through the edu- 
cational provisions of the GI bill, and funds have been provided through 
the Public Health Service and the Veterans Administration for stipends 
or part-time work arrangements for training of graduate students in fields 
where specialists will be needed in the future. An increasing number of 
scholarships and fellowships is being awarded annually for study at both 
graduate and undergraduate levels by grants from industrial organizations. 
In short, the funds provided for financial aid to students come from 
varied sources — individual benefactors, philanthropic organizations, re- 
ligious groups, industry, the state and federal governments, and the educa- 
tional institutions themselves. The motives of those concerned in the allot- 
ment of the funds are even more varied; altruism, income tax exemptions, 
need for specialists, possibilities of obtaining control over new inventions 
or discoveries, and the desire for a winning football team have all been 
involved to a greater or lesser degree. It is not surprising, then, that there 
is also considerable variation in the procedures for administering the 
financial grant or award and in the standards required of applicants. 
This growing tendency toward the subsidizing of worth-while students 


should be considered in its implications for society. The fundamental ques- 
tion to be considered pertains to the ultimate objectives which are to be 
served by granting financial aid to students. Is it our aim to produce aca- 
demicians to carry on the humanistic traditions of liberal education? Are 
we to produce scientists to pursue research considered by our political 
leaders to be of current importance? Do we wish to train specialists to treat 
particular social problems? To groom certain young people for political 
leadership? Or are we to have no definite objective other than to give 
training consistent with each individual's aptitudes and inclinations in the 
hope that something of value will ultimately be accomplished? The more 
specific of these purposes will be criticized on the grounds that they are 
too narrow and tend to restrict unduly the social and scientific develop- 
ment of society. The last broad objective mentioned above might be con- 
sidered by some to be so indefinite as to be no policy at all and hence 
inefl&cient and wasteful; others may claim that any narrower objective 
makes for intellectual dictatorship. 

Whatever policy is chosen must be implemented by some method of 
selection as well as by a curriculum and by training techniques. It is with 
the selection of beneficiaries for financial aid that we are primarily con- 
cerned here. Whoever is responsible for such selection methods must face 
the problem of objectives, whether or not a policy has been determined 
for him. 

University scholarships 

In administering a scholarship program at a university, the standards 
set up are usually analogous to those required for admission except that 
they are higher. Among other qualifications, the successful candidate must 
evidence high general scholastic aptitude. This may be determined on the 
basis of the same tests and indices of school success as are used for admis- 
sion. Or special scholarship examinations may be required, which are 
usually similar in form to a scholastic aptitude test. There may also be 
considered personal requirements, such as high character and qualities of 
leadership, which are ordinarily judged subjectively on the basis of recom- 
mendations and interviews. A third type of requirement is financial need. 
This may be evaluated on the basis of reports by students, usually endorsed 
or supplemented by financial statements from parents. The student may 
be required to furnish an itemized budget showing all sources of income 
and financial resources. Improved methods of estimating the fair share of 
expenses which should be borne by the student's family are being de- 
veloped at some universities. Since scholarship awards are generally made 
from year to year, the recipient, in order to retain his scholarship, is 


usually required to maintain a specified standard of achievement in terms 
of grades. Where funds are made available by donors, certain additional 
restrictions may exist, such as geographical region, race, religion, field of 
study, school attended, or even family name. 

The avv^arding of university scholarships is not always an objective and 
impartial matter. And since techniques of evaluation are most subjective 
in the area of character and personality, it is here that abuses are most 
likely to occur. Alumni groups may, for example, throw strong support to 
a candidate on grounds other than those of scholastic achievement and 
aptitude, with results which are occasionally unfair to those who lack 
influential partisans. The application of carefully developed measurement 
techniques, where available, tends to put the awarding of scholarships on 
a fairer basis. The improvement of techniques for awarding scholarships 
involves essentially the same problems of measurement as have already been 
discussed in connection with college admissions. 

Methods used by other organizations sponsoring scholarships 

Increasingly scholarships are being sponsored by organizations other 
than universities — business and industrial organizations, philanthropic 
foundations, and religious groups. The usual arrangement is for the funds 
to be administered by a university or by an administrative group composed 
primarily of educators, presumably to avoid any semblance of an outside 
control of higher education. 

Likewise, the federal government has avoided control of education. A 
veteran of World War II must be admitted to a college through its regular 
admission procedures before GI funds are made available for tuition and 
subsistence, and no restrictions are set up by the government as to course 
of study. In the new NROTC program there are again no restrictions as to 
course of study beyond the hours which must be devoted to naval science, 
and each NROTC college is free to exclude any student, even though he 
has been accepted by the Navy through its own selection program. 

In recent years a number of scholarship programs sponsored by non- 
educational groups have been set up on a nation-wide basis, with awards 
based on competitive examinations. The Westinghouse Electric and Manu- 
facturing Company has, for example, established a number of scholarships 
for engineering training. Under the name of "Science Talent Search," a 
scientific aptitude examination is administered annually to a large num- 
ber of high school seniors. This aptitude test is made up of items involv- 
ing number series, vocabulary, mechanical movements, mathematical move- 
ments, mathematical reasoning, and spatial relations, arranged in omnibus 
fashion, together with a section on science reading comprehension. Each 


competitor qualifying on the aptitude test submits an essay on some sci- 
entific project he has himself carried through. A small group deemed to 
have produced the best essays are then brought together for several days 
in Washington, D.C., with a group of recognized scientists, who thus have 
ample opportunity for appraising the personal qualifications of each 
candidate. The final awards are made at the close of this period on the 
basis of all the data available. The interesting feature of this method of 
selection is that it applies the so-called "successive hurdles" approach, thus 
supplying very complete information on which to judge the intellectual 
and personal qualifications of the "finalists" from among whom the win- 
ners are selected. Its principal disadvantage is one common to all selec- 
tion methods based on successive hurdles: highly desirable candidates may 
be eliminated in the early rounds of competition because of insuflficient 
evidence. While the method unquestionably identifies excellent candi- 
dates, it may fail to identify all the best candidates. 

The same general method is employed by the U.S. Navy in the selec- 
tion of candidates for its college programs. Under the terms of the Hollo- 
way Plan, the U.S. Navy is admitting to the naval and naval aviation 
college program more than two thousand young men each year under con- 
ditions equivalent to generous scholarship awards. These men are admitted 
to the program through procedures set up by the Navy; they must also 
be accepted by the NROTC college or university through its regular admis- 
sion procedure. The first step in the Navy's selective process is the adminis- 
tration, at special centers throughout the country, of a selection test con- 
taining items of the scholastic aptitude type. Through this examination a 
preliminary screening is made. The candidates are previously informed 
about certain other requirements, such as freedom from certain physical 
defects, and those candidates who know of factors which would prevent 
their admission to the program are discouraged from applying. 

Those who survive the preliminary screening are directed to report to 
an Office of Naval Officer Procurement for further processing, which 
consists of a physical examination and interviews. Only those candidates 
who survive the physical examination are interviewed. Considerable care 
has been taken in designing a standardized interview situation, in the 
course of which interviewers use standard rating forms; each candidate is 
interviewed independently by two officers trained in the technique of 
interviewing. All the information — test scores, interviewers' ratings, and 
a composite score — then goes to a regional board, which determines which 
of the candidates are to be accepted. All these procedures are definitely 
oriented toward fairly specific objectives — the selection of men capable of 
successful college work and endowed with qualities which will make them 


successful officers in the United States Navy. The successful candidates 
enjoy midshipman status and its equivalent in pay while they are students. 
This whole program of selection has been planned with unusual care and 
may serve as a valuable proving ground for testing certain selection pro- 

Scholarships and fellowships for graduate study 

The awarding of fellowships and scholarships for graduate study at 
most graduate schools is usually a part of the admissions procedure. As 
described previously, the typical methods are largely subjective evaluations 
of such materials as transcripts and letters of recommendation, culminating 
in a rank-ordering of the candidates. The best fellowship might be offered 
to the highest-ranking candidate, and so on down the list. Fellowships 
established by industrial or business organizations are usually administered 
by the graduate school in the same manner as the university's fellowships. 

The National Research Council and the Social Science Research Council 
are awarding predoctoral fellowships which they, rather than the univer- 
sity, administer; the methods of selection are generally similar to those 
employed by the universities themselves. The National Research Council 
uses, in addition to other evidence, the results of a battery of aptitude and 
special field tests. 

While the foregoing is only a brief review of methods of administering 
student scholarships and other financial aids, it may give some notion pf 
the variability in objectives and techniques in current usage. The objectives 
are both specific and varied; yet we do not find a corresponding variation 
in the measurement devices used, except to some extent in the content of 
instruments for appraising aptitude. Verbal, quantitative, spatial, and 
mechanical comprehension items are used in varying proportions, depend- 
ing upon the objectives of the scholarship program. We do not, however, 
find much similar variability in appraising personal qualities. With our 
inadequate knowledge of how personality characteristics are related to 
success in different fields of academic endeavor, we cannot now say whether 
or not distinctions can be made in this area. Perhaps the desirable personal 
qualities are the same, regardless of the field of specialization; or perhaps 
research will eventually make it possible to designate differentially and 
measure crucial personal qualities. 

Use of Measurement in Articulation 

In the educational progress of a student through the elementary grades, 
high school, and college, it is, of course, extremely important that he be 
enrolled in courses which are appropriate to his level of proficiency at any 


given time — that he take courses which are neither too difficult nor which 
involve wasteful duplication of earlier-learned content. This problem is 
particularly acute at the times of transition from elementary school to high 
school and, even more so, from high school to college, because at these 
steps there is ordinarily a break in the continuity of supervision and a 
change in administrative control. The problem of articulation has to do 
with techniques and procedures for bridging these transitional points so 
that there is continuity in the educational program and so that each student 
is enrolled in courses which are consistent with his interests and level of 
academic proficiency. 

Various factors tend to interfere with good articulation between ele- 
mentary and high school and between high school and college. The wide 
variability in standards of achievement and in level of talent from school 
to school, for example, tends to handicap good articulation; a certificate 
of graduation at a certain educational level is no guarantee that any fixed 
standard of proficiency has been met. Another factor is the increasing 
flexibility of curricular programs; course requirements tend to be less 
rigorous than formerly, and in some schools superior students may be 
encouraged to advance far ahead of their classmates. Finally, some school 
administrators fail to realize the problems of articulation and consequently 
make small effort to solve them. 

That the problem of articulation is real and far from solution is shown 
by several studies made by educators. Such studies show, for example, that 
a large proportion of the content of some freshman college courses is a 
duplication of high school courses; that a fair proportion of high school 
seniors may surpass the median of college sophomores or even seniors 
in certain achievement test scores; and that superior high school students 
are frequently able to show good proficiency before taking certain ele- 
mentary college subjects. It has been shown that superior students who 
were admitted to college at the end of the third year of secondary school 
did as well as students with four years of such schooling. These and many 
other studies indicate that there is often considerable inefficiency and 
wasted effort in the educational program, particularly for the superior 
group. No less serious is the converse of this problem; many students are 
placed in high school or college courses for which they lack the basic 
preparation. The growing tendency to institute special remedial courses 
in mathematics and reading at the college level is symptomatic of this 

There are at least two important aspects to the problem of articulation. 
One pertains primarily to curriculum planning, to the proper organization 
and coordination of the courses of study at the school levels involved. 


This topic has been discussed in chapters 1 and 2. The other aspect, per- 
taining to the proper placement of each individual student, is the major 
concern of the present section. 

Use of Tests in Improving Articulation 

First of alljk the improvement of articulation between both elementary 
and secondary school programs and between secondary school and college 
programs depends upon adequate guidance facilities. Since a counselor 
or faculty adviser cannot perform his duties effectively without adequate 
information about not only the requirements of the various courses but 
also the aptitude, level of proficiency, and interests of each student, he 
should have this latter information available to him through two sources: 
educational records and tests, both general and differential. \ 

Within school systems where there is coordination of administrative 
control, the problem can be handled fairly well through educational 
records, provided that grades in specific subjects are based on reliable tests 
of achievement. The Educational Records Bureau has contributed to the 
solution of this problem by cooperating with schools and school systems 
in planning record systems and conducting testing programs. 

Frequently, however, records furnish an inadequate solution because 
schools have varying standards, curriculums, methods of instruction, and 
grading systems. This is particularly true in articulation between secondary 
school anct college, where there is seldom any continuity in administra- 
tion, and students may come from schools widely scattered throughout 
a state or even the nation. In such a situation the use of special tests of 
aptitude and achievement is particularly useful in improving articulation. 

Artictdat'wn of secondary school and college programs 

jJThe tests used in improving articulation between secondary school and 
college may be administered either near the time of graduation from school 
or after admission to college; the first method has the advantage of making 
the test data available for use in admissions procedures as well as in 
articulationTj ' 

The Regents Examinations in New York State, originated in 1877, are 
the oldest state-wide achievement testing program in existence. Examina- 
tions are given to high school seniors throughout the state in a wide variety 
of subjects ranging from English, American history, and French, to music, 
art, typewriting, and shorthand. Scores on these tests when reported to the 
college of the student's choice furnish valuable data for the admissions 
officer and also for advisers who are concerned with the problem of specific 
course assignments. Careful use of such information can help to avoid the 


many academic difficulties caused by placing students in courses for which 
they are either over- or under-prepared. 

The aptitude and achievement examinations of the College Entrance 
Examination Board are also administered near the end of the senior year 
in high school and thus provide colleges with information which can be 
used to articulate the college freshman program with the high school 
courses of incoming students. To make these or any other such tests maxi- 
mally useful, each college employing them must conduct local studies to 
establish standards for admission to particular courses. Depending on the 
nature of the course, such a study may take the form either of predicting 
what the scores mean in terms of subsequent success or of simply establish- 
ing norms on known populations. At Harvard, for example, it has been 
found appropriate to determine course placement in freshman mathematics 
by predicting the student's probable success from his C.E.E.B. mathematical 
aptitude score. In various language courses, however, definite test norms 
have been established by administering the tests to students who are on the 
point of actually completing the courses at different levels. Course place- 
ment in this latter situation is achieved by comparing the test performance 
of the incoming students with the various norms so established. 

The distinction between these two approaches to the problem may be 
important. In the case represented by mathematics, the background of all 
incoming students is roughly comparable, so that the aptitude score is, 
in a general way, a measure of the facility with which the student can 
handle quantitative and symbolic concepts. In a sense it provides a measure 
of the slope of the student's learning curve on the subject, and success in 
freshman mathematics appears to be predictable from a knowledge of this 
slope regardless of the actual position on the curve. In other words, the 
decision is between a "slow" course and a "fast" one. In the case of the 
languages, however, it appears that the best placement is made from a 
knowledge of the student's position on the learning curve. The decision 
depends on whether the student has learned enough in secondary school 
to enable him to cope with the content of a particular course. 

Various other regional high school testing programs are in progress, 
such as those of Iowa, Minnesota, and Illinois, and the scores are being 
put to use by both schools and colleges within the state. 

One objection to such testing programs, whether on a state- wide or 
national scale, is that a college adviser cannot count on each of his group 
having taken the tests which are appropriate to his individual problems. 
The student may have escaped the test because he came from a region 
where such tests were not used or, in the case of the C.E.E.B. examinations, 
he may have made an inappropriate selection. These annoying irregularities 


can largely be overcome by setting up a freshman testing program at the 
college. If such a program is carefully planned and tests rapidly scored 
and reported on, pertinent information for practically every freshman can 
be in the hands of each adviser prior to the time of registration. Such a 
program, however, obviously can serve only to improve articulation, and 
cannot help in admissions work. 

The Cooperative Test Division of the Educational Testing Service, pre- 
viously mentioned, provides a large number of achievement tests which 
are appropriate for placement purposes, and in its annual National College 
Freshman Testing Program recommends a battery of tests to be adminis- 
tered to the entire class in an institution. Also available for this purpose 
are the Iowa Placement Examinations, which have been rather widely used. 

The University of Chicago has developed its own set of College Place- 
ment Tests, prepared by its Board of Examiners. These tests cover various 
fields such as American history and government, humanities, physical 
sciences, mathematics, English, and foreign languages. The scores obtained 
by an individual on various pertinent tests are used in assigning him to 
particular courses. The tests are divided into Common and Special place- 
ment tests, of which all students take the former. If the student's perform- 
ance is sufficiently high, he is excused from certain first course require- 
ments, and students with unusually high scores are invited to the Special 
placement tests. If the student's performance on the Special test is suffi- 
ciently high, he is excused from certain advanced courses. Theoretically^ it 
is possible for a student to meet all the requirements of the College of the 
University of Chicago through the passing of placement tests. The division 
of the tests into Common and Special insures that only very able students 
will need to take more than twelve hours of placement tests. 

Articulation of elementary and secondary school programs 

While the problems of articulation at this level are essentially the same as 
those between high school and college, it is often administratively easier 
to provide the basic achievement test data here, because elementary schools 
and high schools are usually operated as part of the same public school 
system. It is thus easier to organize a unified procedure of achievement 
testing and record-keeping and to see that adequate elementary school 
records accompany each student to his secondary school. 

A number of commercially published achievement tests are available at 
this level. Among the best known and most widely used may be mentioned 
the Iowa Every-Pupil Tests of Basic Skills, the Metropolitan Achievement 
Test, Modern School Achievement Tests, the Progressive Achievement 
Test, and the Stanford Achievement Test. 


Articulation within the school 

The problem of articulation is, of course, not confined to the period of 
transition from one school to another; it is also essential that, as the student 
progresses from grade to grade within a school, he be enrolled in courses 
which are as well fitted as possible to his particular level of proficiency and 
to his educational-vocational objectives. The problems here, at least for the 
elementary school, largely involve the relation between the curriculum and 
the maturity of the children. 

Initially the child's educational level is established for the most part 
by his age. As he progresses up the educational ladder, it may become 
apparent that he is placed in a grade which is either too advanced or too 
retarded for his development. Some children are intellectually qualified to 
work at a more advanced level than most of their classmates; others, at a 
lower level. Readjustments may be made as a consequence of such situa- 
tions. The usual expedients are nonpromotion and "skipping" (double- 
promotion) or acceleration. The limitations of these procedures are pre- 
sented in chapter 1. 

The problem is, of course, not as simple as is implied in the foregoing 
paragraph. Factors other than intellectual maturity, such as social and 
physical development, must be taken into account, and the psychological 
repercussions of such a drastic step as failure to promote must be weighed. 
Even "intellectual" maturation may be uneven, so that a pupil is qualified 
for more advanced work in some respects only; for example, he may be 
exceptionally proficient in arithmetic but retarded in reading. Procedures 
for providing for individual needs in heterogeneous groups, as treated in 
chapter 1, should probably receive greater emphasis in the schools. Suitable 
tests can make important contributions to the solution of such problems. 
The next topic, "Use^of Measurement in Classification," deals with some 
particular aspects of the problems involved in attempting to place pupils in 
the educational environment which is best adapted to their degree of 
growth and development. 

Use of Measurement in Classification 

In recognition of the wide individual differences among pupils in learn- 
ing ability, adequacy of preparation, interests, industry, social maturity, 
and so on, the attempt is often made to segregate pupils into more or less 
homogeneous groups so that group methods of teaching may be effective. 
The simplest of these attempts places pupils in separate school grades solely 
on the basis of age. Perfect homogeneity is impossible to attain and, as a 
matter of fact, would probably be undesirable; but some degree of homo- 


geneity beyond mere age classification is generally agreed to be desirable. 
A teacher finds it impossible in a classroom situation simultaneously to 
interest and encourage the bright students and to prod the dullards up to 
some minimum standard of achievement, n'he attempt to place school 
children in homogeneous groups is referred to as "classification" or, some- 
times, "sectioning" (especially in the upper grades). The purpose of 
classification is to reduce the range of individual differences in instructional 
groups and thus make possible a type of instruction and educational en- 
vironment more nearly suited to the needs of each individual stu dent) 
Since classification has already been discussed in chapter 1 in reference to 
the improvement of the learning situation, it is unnecessary to discuss 
further its advantages and limitations for this purpose. 

[[^Classification has been based on various types of criteria, used singly 
or in various combinations. The most appropriate criteria are those which 
give the best prediction of achievement. The types of measures in common 
use (other than age) include records of achievement in previous grades, 
tests of general mental ability, achievement tests, and reading readiness 
t ests.J 

The type of measure to be used depends somewhat on the organization 
of the school. If pupils are to be classified into groups which remain the 
same for all parts of the curriculum, general measures are appropriate; 
while if pupils are classified differently for different courses, such as read- 
ing and arithmetic, then more specific types of measures are indicated. - 

Arhe predictive value of tests of general mental ability is fairly high for 
average achievement and achievement in subjects such as reading and arith- 
metic. They are far less successful in predicting proficiency in spelling and 
handwriting. Achievement tests and school grades are successful in about 
the same degree as mental ability tests in predicting later achievem ent, j 
These general statements are based on a large number of separate studies, 
the results of which are by no means uniform. It would be necessary for 
each school to study the predictive value of various measures and to use 
those which are most successful with its own methods of instruction and 
techniques of appraising achievement. 

[Reading readiness tests are widely used in the early grades to determine 
whether the child's perceptual processes are sufficiently mature to enable 
him to make the fine visual discriminations that are necessary in the transi- 
tion from speech to written symbols/ The Gates Reading Readiness Tests, 
which are among the most widely used, require the child to perform such 
tasks as finding which two of four printed words are the same, finding 
which one of four printed words is the same as the one shown on a card 
by the examiner, naming letters and numbers, and finding the picture in a 


set the name of which sounds almost hke the name of another picture. 
Such tests tend to correlate about .50 with mental age. Reading readiness 
tests have been found to be the best predictors of achievement in reading 
and have been used as one basis for determining when the teaching of 
reading should be begun for the individual child. 

Even with the most careful attempts to bring together for instruction 
students who are homogeneous with respect to their ability to learn a 
particular kind of material, a good deal of variability is bound to be present. 
Individuals are such complex creatures that, even with the best measures 
so far created, it will not be possible to discover students who are similar 
in all aspects of motivation, interests, skills, and scholastic aptitude. On 
the other hand, by proper use of suitable tests it is possible to decrease 
significantly the variability with respect to particular abilities and skills that 
are found to be important in the teaching of a certain subject. 

It is, of course, possible to classify pupils on the basis of their achieve- 
ment during a trial period of, say, three to four weeks. The advantage of 
tests is that the trial period may be eliminated, thus avoiding the feelings 
of insecurity which frequent changes of class are likely to produce in chil- 
dren, and permitting a quicker adjustment of both teacher and pupils to the 
classroom situation. 

This chapter has discussed problems related to admissions, articulation, 
and classification for which testing techniques have already offered some 
partial solutions, and has indicated other problems where equal success may 
be encountered in years to come. In all such applications, the limitations of 
tests and of measuring devices, as- well as their potentialities, must be con- 
sidered. As research on educational measurement is continued by universi- 
ties and testing organizations, better solutions to the problems yet unsolved 
will doubtless be forthcoming. In the meantime, the importance of an 
experimental approach by each educational institution to its own problems 
of selection and placement by whatever means are available cannot be 
overemphasized. All educational problems cannot be settled by the large 
research organizations; the methods of making optimum use of educational 
and psychological measurements must ultimately be discovered by each 
school in the light of its own particular situation. 

Selected References 

1. Bengtson, Nei.s a. Annual Report of the University Junior Division. Lincoln: 
University of Nebraska, 1941. 

2. Euros, Oscar K. (ed.). The Third Mental Measurements Yearbook. New Brunswick, 
N.J.: Rutgers University Press, 1949, 


3. Crawford, Albert B., and Burnham, Paul S. Forecasting College Achievement. 
New Haven: Yale University Press, 1946. 

4. Donahue, Wilma T.; Coombs, Clyde H.; and Travers, Robert M. W. Measure- 
ment of Student Adjustment and Achievement. Ann Arbor: University of Michigan 
Press, 1949. 

5. Dyer, Henry S. "Evidence on the Validity of the Armed Forces Institute Tests of 
General Educational Development (College Level)," Educational and Psychological 
Measurement, 5: 321-33, 1945. 

6. Findley, Warren G. "A Group Testing Program for the Modern School," Educa- 
tional and Psychological Measurejnent, 5: 173-79, 1945. 

7. Frederiksen, Norman. "Predicting Mathematics Grades of Veteran and Non- Veteran 
Students," Educational and Psychological Measurement, 9: 73-88, 1949. 

8. Garrett, Alfred B. "Giving College Credit in Chemistry by Examination," Ohio 
Schools, 2A: 556-57, 1946. 

9. Learned, William S. Examinations and Education. Carnegie Foundation for the Ad- 
vancement of Teaching, Forty-First Annual Report, 1945-1946. New York: The Foun- 
dation, 1946. 

10. Learned, William S., and Wood, Ben D. The Student and His Knoiuledge. ("Carnegie 
Foundation for the Advancement of Teaching Bulletin," No. 29.) New York: The 
Foundation, 1938. 

11. Munroe, Ruth Learned. Prediction of the Adjustment and Academic Performance of 
College Students by a Modification of the Rorschach Method. ("Applied Psychology 
Monographs," No. 7.) Stanford, Calif.: Stanford University Press, 1945. 

12. Ross, Clay C. Measurement in Todafs Schools. 2nd ed. New York: Prentice-Hall, 

13. Ryans, David G. "A Study of the Observed Relationship between Persistence Test 
Results, Intelligence Indices, and Academic Success," Journal of Educational Psy- 
chology, 29: 573-80, 1938. 

14. ScHRADER, William B., and Conrad, Herbert S. "Tests and Measurements," Revieiv 
of Educational Research, 18: 448-68, 1948. 

15. Segel, David. Prediction of Success in College. ("U.S. Office of Education Bulletin," 
1934, No. 15.) 

16. Traxler, Arthur E. Techniques of Guidance. New York: Harper & Bros., 1945. ^ 

17. Wood, Ben D., and Haefner, Ralph. Measuring and Guiding Individual Growth. 
New York: Silver Burdett Co., 1948. 

Part Two 

5. Preliminary Considerations in Objective 
Test Construction 


Siate University of Iowa 

Collaborators: Henry Chauncey, Educational Testing Service; Walter 
W. Cook, University of Minnesota; Warren G. Findley, Educational 
Testing Service ; T. R. McConnell, University of Minnesota; Ralph W. 
Tyler, University of Chicago ; Ben D. Wood, Columbia University 

For the purposes of this volume, the construction of an 
educational achievement test will be regarded as consisting of five major 
steps, as follovv^s: 

1. Planning the test 

2. Writing the test exercises 

3. Trying out the test in preliminary form and assembling the finished 
test after tryout 

4. Determining the procedures and preparing the manuals for admin- 
istering and scoring the test 

5. Reproducing the test and accessory materials 

Each of these steps will be separately considered, so far as the objective 
type of test is concerned, in chapters 6-11. Problems of test construction 
relatively unique to performance and essay tests will be considered in chap- 
ters 12 and 13 respectively. The scaling of the test and the establishment 
of norms, which, in a sense, are also a part of the task of test construction, 
will be discussed in Part Three, "Measurement Theory." 

This analysis of the steps involved in test construction takes for granted 
that certain preliminary decisions have already been made concerning the 
general nature and purposes of the test to be constructed. Preliminary 
to actual test construction, the author has presumably selected, or specified 
in broad general terms, the educational objectives the attainment of which 
is to be measured by the test. Presumably, he has also specified, at least in 
general terms, the population of individuals with which the test is intended 
to be employed, the conditions under which the test will be administered 
and used, and the major uses to which the test results are intended to be 
put. For example, the test author may already have decided that his test 



is to measure the ability of high school students to use the English language 
correctly in their own writing, and that the test is intended to serve pri- 
marily as a basis for educational guidance of high school students and for 
the evaluation of high school instruction in this area. Again, the test con- 
structor^ may have decided that his test is to determine how well the stu- 
dents in a particular course in United States history at the ninth-grade 
level have attained the immediate objectives of instruction in this course, 
with the intention that the test will be used as the principal basis for as- 
signing course grades and determining whether the student has passed or 
failed the course. 

It should be apparent that the decisions which are made preliminary to 
actual test construction are, from the broadest point of view, far more im- 
portant or crucial than those which follow. The contributions of educa- 
tional measurement to the educational process as a whole, and to its im- 
provement, depend as much or more upon the ability of test constructors 
to recognize the situations in which tests are most needed or can do the 
most good, or upon their ability to anticipate the good and bad effects of 
the projected tests upon educational practices, as they do upon the technical 
competence of the test constructors. It is, of course, important that whatever 
tests are constructed measure validly and dependably whatever they do at- 
tempt to measure, but it is even more important that what they do attempt 
to measure be worth while and significant, and that through their use the 
tests and test results exercise a desirable influence upon the aims, habits, 
attitudes, and achievements of students, teachers, counselors, and school 

The major purpose of this chapter is to review and evaluate the broad 
trend of these preliminary decisions in the recent history of educational 
achievement testing, and to consider the possible need for a fundamental 
redirection of these preliminary decisions in the future. A second purpose 
of the chapter is to consider briefly the various possible "approaches" 
which the test constructor may take to the measurement of educational 
achievement in general. Upon considering the difficulties of measuring the 
objective that he has tentatively selected, the test constructor may often 
decide to forego its measurement entirely, and may turn instead to other 
more easily tested outcomes. The second purpose is thus very closely re- 
lated to the first. The chapter as a whole, then, is basically concerned with 
the problem of deciding what to measure, but serves as well to introduce 

* For convenience in this discussion, the term "test author" or "test constructor" will be 
used in place of the more appropriate term "test construction agency." In modern practice, 
the "author" of a standardized test is usually a team of individuals, including specialists 
in subject matter as well as in the techniques of test construction. 


che problem of how to measure, with which the remaining chapters of 
Part Two are concerned. 

The Selection of Objectives 

Ultimate vs. Immediate Objectives 

Many of the basic objectives of school instruction cannot possibly be 
fully realized until long after the instruction has been concluded. For 
guidance in specific courses of instruction, however, it is common practice 
to set up less remote objectives — objectives which are capable of immediate 
attainment. Ideally, these immediate objectives should in every instance 
have been clearly and logically derived from accepted ultimate objectives, in 
full consideration of all relevant characteristics of the pupils who are to 
receive the instruction. Ideally, also, the immediate objectives should be 
supported by dependable empirical evidence that their attainment will 
eventually lead to or make possible the realization of the ultimate objectives. 
Finally, the content and methods of instruction should, ideally, be logi- 
cally selected, devised, and used with specific reference to these immediate 
and ultimate objectives, and should likewise be supported by convincing 
experimental evidence of their validity. 

Unfortunately, this ideal relationship among ultimate objectives, im- 
mediate objectives, and the content and methods of instruction has only 
rarely been approximated in actual practice. Some of the content of current 
instruction, if derived at all from sound and accepted ultimate objectives, 
has been derived from them by a process of faulty inference, and contrib- 
utes much less to the realization of the objectives than other content which 
could be substituted for it. More unfortunately, a portion of the present 
content of school instruction is there only by reason of the organization of 
the curriculum by "subjects," and because of the practice of introducing 
new materials in intact subject units, or subject by subject, often without 
any careful selection of the detailed content of those subjects. As a result 
of this practice many detailed elements which have no relationship what- 
ever to any ultimate objectives have entered the curriculum simply because 
they "belonged" in the same broad category of knowledge, or in the same 
subject, with other content which could be readily justified, and because of 
which the subject as a whole was selected. A considerable portion of the 
factual content of current instruction in courses in United States history, 
for example, is of this character. This is not to say that the place of this 
subject in the curriculum cannot be as readily justified as any other. Knowl- 
edge of certain facts in American history undoubtedly contributes to the 


realization of certain important educational objectives. It does not follow 
from this, however, that any historical information has a place in the cur- 
riculum. It is not the teaching of United States history per se which contrib- 
utes to the objectives, but the appropriate use in instruction of certain 
specific historical materials which have been carefully selected with these ob- 
jectives in mind. Nevertheless, many historical facts are included in the 
school textbooks for apparently no better reason than that they are of interest 
to historians or that they are a part of "American history," and many other- 
wise useful facts are taught in such a way as to minimize their contribution 
to desirable educational objectives. History courses are not, of course, 
alone in this respect. Similar statements, with more or less justification, may 
be made about every subject in the school curriculum. 

As a result of this organization by subjects, furthermore, many other 
elements of content may survive in the curriculum long after their 
usefulness has gone, simply because they "belong" to a subject which is 
still, on the whole, of unquestioned value, or which retains its place in 
the curriculum, in spite of its questionable value, by reason of tradition, 
inertia, and the power of vested interests. 

With reference to subjects of the type last noted, it is frequently true 
that, instead of the content and immediate objectives of instruction having 
been derived from any accepted ultimate objectives, just the reverse of 
this relationship obtains. That is, many of the immediate objectives were 
actually derived from, or adapted to, the traditional content; and their 
claimed relationships to ultimate objectives — as well, sometimes, as the 
ultimate objectives themselves — were "thought up" in an effort to rational- 
ize the continued teaching of the traditional content. It has been claimed, 
for example, that knowledge of Latin will contribute to improved reading 
comprehension in the vernacular, through developing a better understanding 
of English words and phrases of Latin origin, or through developing a more 
accurate grammatical sense, and so forth. While there may be some truth 
in these claims, they are clearly made in an effort to justify the perpetua- 
tion of Latin in the school curriculum, rather than to justify its selection in 
preference to all known alternative and more direct ways of improving 
English reading comprehension. Again, of course, Latin has been used 
only as a convenient example. For any subject in the school curriculum, 
particularly for the long-established subjects, many of the immediate ob- 
jectives claimed for them have a similar origin. Indeed, one of the most 
common of all red objectives in teaching, in general, is "to teach the facts 
contained in the textbook." Sucli objectives as "to acquire a sound knowl- 
edge of world geography," "to know the important facts of United States 
history," "to understand common natural phenomena," are often only an- 


other way of saying, "to know what is in the text." The really functional 
objectives of many school subjects — the day-by-day objectives that most 
teachers are actually trying to attain — are, in large part, content objectives 
of this type. 

The foregoing is not intended as a general indictment of the practice 
of organizing the curriculum by subjects. The practice clearly has its ad- 
vantages as well as disadvantages, and many of its undesirable features 
could be eliminated while retaining the advantages. Whatever one does 
about curriculum organization, there will always be the problem of deriving 
immediate objectives from those more remote or fundamental and of allo- 
cating objectives to appropriate units of instruction. Furthermore, while 
"subjects" may remain the same in name, they may be, and frequently are, 
changed in content and method so that certain ultimate objectives may be 
more effectively realized. Our concern in this discussion, then, is only with 
the implication of the present subject organizations and content of the cur- 
riculum for achievement testing, and not with the pros and cons of sub- 
ject organization in general. 

Immediate Objectives of Traditional Subjects 
IN Achievement Testing 

The situation just reviewed is one of particularly serious concern with 
reference to educational achievement testing. Practically all of the standard- 
ized achievement tests and informal school examinations that have been 
constructed to date have been designed to measure achievement in estab- 
lished school subjects. At the high school and college levels, particularly, 
standardized achievement tests that are intended to measure the attainment 
of general objectives, or that disregard or cut across subject-matter bound- 
aries, have constituted only a very small proportion of the total offering. 
Furthermore, these tests, almost without exception, have been based on 
the actually functioning (as opposed to the claimed or theoretical) immedi- 
ate objectives of instruction — often with little regard to the possible lack 
of relationship of these immediate objectives to any important ultimate ob- 
jectives of the entire educational program. Where no authoritative state- 
ments of immediate objectives have been available, or where such objec- 
tives have not been sufficiently specific or meaningful, the test constructor 
has often set up or derived his own immediate objectives for the test. He 
has usually derived these objectives, however, not from any general ulti- 
mate objectives, but from the common content of current instruction. Analy- 
sis of textbooks and courses of study has constituted the most common tech- 
nique for validating educational achievement tests. Accordingly, just as the 
real objective of instruction has been "to teach what is in the textbook," so. 


in many instances, the real objective in testing has, to an even greater de- 
gree, been "to test for recall of what is in the textbooks," or "to test what 
is now being taught." 

Even where the prevailing immediate objectives of instruction have been 
most seriously questioned, test builders have continued, for purposes of 
test construction, to take these immediate objectives for granted. Achieve- 
ment tests for high school Latin, for example, have been concerned ex- 
clusively with the students' ability to read or write Latin, and their builders 
have made no pretense whatever, in these tests, of measuring the extent to 
which this ability contributes to the realization of any of the basic objectives 
of the whole program of general education. Tests in high school mathe- 
matics similarly have measured such things as the ability to factor poly- 
nomials, with little regard to the probable social utility of such skills, etc., 

On the whole, then, efforts of test builders to improve educational 
achievement tests have been largely confined within the framework of the 
prevailing subject-matter organization of the curriculum and of the gener- 
ally accepted immediate objectives of instruction. Within these restrictions 
a vast amount of progress has been made, but it has consisted primarily of 
the introduction of further technical improvements and refinements, pro- 
viding for increased comparability of scores from test to test, for increased 
reliability and efficiency in measurement, for better coverage or sampling 
of the content of instruction, for more representative and more reliable 
norms, and for improved administrative and scoring procedures. The im- 
provements that have thus been made, however, have been real improve- 
ments only in relation to certain relatively narrow classroom uses of tests 
(which will be discussed later) . The greater opportunity, that of providing 
better tests for the purposes of individual student guidance and of curricu- 
lum evaluation in relation to a more enlightened concept of general educa- 
tional objectives, has been seriously neglected. 

As has just been implied, achievement test builders have not been with- 
out considerable justification for concerning themselves so exclusively with 
the construction of subject tests based on prevailing immediate objectives. 
Most achievement test builders will readily grant the truth of the preceding 
comments, at least so far as the typical school is concerned, and are quick to 
recognize the need for tests and other instruments that may be used in 
curriculum evaluation and in guidance. Many of them would contend, 
however, that a test cannot be highly valid for purposes of curriculum 
evaluation and at the same time also be appropriate for most of the uses 
to which achievement tests are generally put. That is, many of them would 


assert that there is a need for two distinct types of tests. While recognizing 
the need for evaluation instruments, they have felt that the construction of 
such instruments involves taking a responsibility for curriculum modifica- 
tion that they, as test technicians, are unwilling to assume. In other words, 
many test builders have taken the position that, while it is their responsibil- 
ity to keep pace in their tests with accomplished curriculum changes, it is 
not their business to bring about changes in the curriculum. Their position, 
further, would be that as long as large numbers of teachers are teaching a 
subject in accordance with certain generally accepted immediate objectives, 
those teachers have a right to know how well they have accomplished those 
objectives — no matter how far those objectives may be out of line with the 
most advanced thinking concerning more remote objectives, or how little 
the subject as a whole really deserves its present place in the curriculum. 

Most test users — that is, most teachers of special subjects — would readily 
subscribe to the latter point of view. Teachers of special subjects are not, 
in general, held personally responsible for the place of those subjects in 
the curriculum of their school, nor even, in most instances, for the immedi- 
ate objectives and detailed content of those subjects. They are given certain 
subjects to teach, and often the content is prescribed for them in detail 
(that is, the textbooks have been selected for them). They are, however, 
held personally responsible for "getting the content across" to their pupils, 
and all too often the results of subject tests have been used to hold them 
thus responsible. Accordingly, the more exclusively the tests used are con- 
cerned with that which the teachers have actually been trying to accomplish, 
and the less the tests are concerned with things for which the teachers do 
not feel responsible, the more acceptable the tests are to them. 

Most teachers, for these reasons, prefer standardized tests that differ 
from their own informal examinations only in that they are provided with 
norms, and that they are more reliable, more objective, more easily scored, 
and more highly refined technically, than their own product. In his own in- 
formal examining, the typical teacher has favored questions that can be 
passed only by students who have taken his particular course, and who are 
directly and intimately acquainted with its unique organization and content. 
Questions that can be answered on the basis of general information, or 
which many of the students could have answered before taking the course, 
or to which they could have learned the answers outside of class, are for 
those reasons suspect and are excluded from the examination. Whether con- 
sciously or unconsciously, many teachers, in their own informal testing, 
have tried to build examinations that will discriminate as sharply as pos- 
sible between students who have taken and those who have not taken their 


particular courses. The result has often been, for example, that two equally 
good students taking courses of the same title but under different instructors 
(even in the same school) miglit each fail on the examination intended 
for the other student, even though each might make an A on the examina- 
tion intended for his own course. The standardized tests which are most 
closely adapted to the local curriculum in this sense are those most often 
selected and used by the subject teacher. 

Further justification for the practice of basing achievement tests only on 
prevailing immediate objectives may be readily adduced from the point of 
view of the individual student. Standardized tests are widely used in the 
assignment of school marks, in the measurement of school progress, for 
purposes of pupil motivation, and in the maintenance of standards. With 
reference to these uses it may be argued that the pupil is not responsible 
for instructional objectives, and that if he accomplishes well what he has 
been asked to accomplish he should be given credit and rewarded for it; 
that it would be unfair to penalize him for failing to learn something 
whicli he had neither been encouraged nor given an opportunity to learn; 
and that the use of the same test for purposes of curriculum evaluation and 
pupil motivation would only confuse the pupil rather than motivate him 
to greater effort. 

For the reasons given, tests of course achievement concerned primarily 
with current instructional objectives will continue to be constructed and 
used, as long as the subject organization of the curriculum is retained. 
Attention will later be given in this chapter (pages 138-40) to the ques- 
tion of how and by whom such tests may best be constructed. The only 
point to be made here is that such tests should no longer occupy so domi- 
nant a place in the whole educational scene as they have in the past, nor 
should they absorb nearly so much of the thought and effort of agencies 
constructing tests for wide-scale use. The arguments presented in the pre- 
ceding paragraphs do not constitute an adequate general justification for 
past practices in the construction and use of educational achievement ex- 
aminations. As a result of these practices, educational achievement tests 
and achievement testing in general have exhibited a number of very serious 
limitations, particularly in relation to guidance and evaluation, and have 
been incapable of measuring or of assuring adequate recognition of many 
of the most important aspects of the total educational development of the 
pupils. It will be shown in the following sections how these limitations have 
been due, in part, to the emphasis on content and on questionable immedi- 
ate objectives, and, in part, to the strict adherence to the subject-matter or- 
ganization of the curriculum. Consideration will then be given to the un- 
desirable consequences, with reference both to the improvement of instruc- 


tion and of educational guidance, of the almost exclusive use of tests with 
these limitations. 

Limitations Due to the Emphasis on Content Objectives 

There is nothing objectionable about content objectives, as such, in 
instruction, so long as the content is used in a manner appropriate to the 
outcomes ultimately sought, and so long as its proper relation to the ulti- 
mate objectives is recognized and understood. There can, of course, be no 
instruction without specific instructional content, and one of the most im- 
mediate and clearly legitimate objectives of instruction is to help the stu- 
dent make the best use of this content. This does not mean that much of 
the content should be memorized for purposes of later recall. Much, if not 
most, of the detailed informational content of instruction in the social 
studies, for example, is presented to the student with little or no expecta- 
tion that he will be able to recall it in accurate detail long after the course of 
instruction has been concluded. The skillful teacher uses this content in 
order to develop broad and meaningful generalizations, to establish trends, 
to illustrate principles, to bring out relationships, to raise problems, to 
exemplify and explain procedures, to characterize institutions, to develop 
attitudes, to establish desirable habits and ways of thinking, to develop the 
ability to evaluate evidence, to develop a good sense of values and sound 
judgment, and so on. The teacher rightly demands temporary learning of 
some of this detailed content, but primarily in order that the content may 
be more effectively used for the aforementioned purposes during the course 
of instruction. It is the generalized outcomes which are expected to be 
permanent; it matters relatively little if the student soon forgets many 
of the detailed facts from which the generalizations were originally derived, 
or by means of which the generalized procedures, skills, habits, attitudes, 
and ways of thinking were originally developed. As a result of an effective 
course of instruction in United States history, for instance, an adult may 
have retained permanently a very adequate picture of, say, Colonial 
America, highly accurate in broad outline and in all essential particulars, 
and quite adequate as a basis for understanding contemporary American 
institutions, traditions, and ideals, and yet he may have forgotten nearly 
all of what he once knew in the way of exact and detailed information 
about specific events and personalities, about specific wars and campaigns, 
explorations and discoveries, congresses and conventions, legislative acts 
and reprisals, and so forth. He may, similarly, have acquired a good under- 
standing of the evolution of our present system of transportation, ex- 
pressed in generalities quite adequate for purposes of thinking about 
contemporary problems, and yet may have forgotten entirely such facts 


as the time of completion, the cost, and the statistics on use of the Erie 
Canal. More important, even though he has forgotten such detailed facts, 
he may have retained, with little loss, whatever contribution the study 
of these facts made to the enrichment of his vocabulary, to the estab- 
lishment of desirable attitudes toward democratic institutions, to his 
ability to do critical thinking about contemporary social, economic, and 
political problems, etc. To a considerable extent, then, memorization of 
detailed content is only a temporary and incidental outcome of instruction — 
a means to an end, rather than an end in itself. 

Closely related to the need for only temporary learning of much of the 
detailed content of the social studies is the wide variation permissible in 
the selection of the specific content used to attain the same generalized out- 
comes of instruction. Two instructors in the same subject, for example, 
may both wish to develop in their students a generalized understanding of 
the nature and purposes of American labor unions. Obviously, a very wide 
choice of illustrative materials is available for this purpose — far more is 
available than need be employed or can be employed in a single course of 
instruction. One instructor may thus decide to use one set of specific ex- 
amples, while the other may use an entirely different selection of instruc- 
tional materials, yet both may serve essentially the same ends. On a some- 
what higher level, different instructors may concern themselves with quite 
different ideas, generalizations, problems, and so forth, yet accomplish 
essentially the same purposes so far as still more remote objectives are 
concerned. One instructor, for example, may devote considerable attention 
to international trade relationships, a topic which the other may neglect 
in order to give adequate consideration to domestic financial problems and 
policies. One may stress certain national cultures and ideologies; another 
may select still others for intensive consideration. Yet the two courses may, 
for all general educational purposes, be completely interchangeable. Even 
different subjects may be regarded as interchangeable with reference to 
still more remote objectives, as is implied and recognized in the practice of 
allowing students to elect different high school and college courses in the 
same broad areas in meeting general educational requirements. 

Comments similar to the preceding may be made about any broad field of 
subject matter, particularly at the level of general education. Various in- 
structors in one of the natural sciences, for example, may all strive to 
develop in their students an understanding of the role of science in 
modern society, of the techniques and methods used by the scientists, and 
of the limitations as well as the possibilities of these methods in answering 
questions about the world. All may be interested in building up in their 
students the basic scientific vocabulary needed in reading and thinking 


about contemporary scientific developments, and all may wish to improve 
the ability of their students to observe scientific phenomena and to draw 
valid inferences from their observations. Yet, different instructors might 
use widely differing instructional content with equal success. As in the 
social studies, even different subjects may prove equally effective in rela- 
tion to such objectives. A course in physics may serve such purposes as well 
as would one in chemistry; a course in botany as well as would one in 
zoology. Again, the fact that the student is unable, months or years after- 
wards, to recall much of the detailed content of a course of instruction in 
science fortunately does not imply that the instruction was futile or that 
the course held no permanent values for him. 

In the field of literature, likewise, one instructor may acquaint his stu- 
dents with certain outstanding literary works, while the other may employ 
an entirely different selection of literary materials. Both sets of instruc- 
tional materials, however, may be used with equal effectiveness for the 
purpose of developing in the students a better understanding of their fel- 
low-beings, of their motives, ideals, and frailties. Both may be equally ef- 
fective in developing the ability to read literature with comprehension and 
enjoyment, or in developing improved tastes in the selection of reading 
materials, or in establishing desirable and lasting reading habits. There is 
probably no other field of instruction that permits wider latitude in the 
choice of instructional materials, or in which such freedom of choice is 
more desirable, than the field of literature. 

The foregoing is not meant to imply that the content type of examina- 
tion has no validity whatever, or that there is no legitimate place in educa- 
tional achievement testing for tests or test items that hold the student re- 
sponsible only for the recall or recognition of specific items of information. 
The case for the continued use of tests of this type rests primarily on two 
considerations. In the first place, some of the content of instruction in any 
subject n taught with the hope that it will be permanently retained by the 
student. No item of information, of course, is taught purely for its own 
sake — every bit of specific content in the course of study must ultimately 
be defended in terms of the number of uses that will later be made of it, 
either during, or subsequent to, formal instruction, and in terms of the 
importance or cruciality of these uses. Some content is useful only in ac- 
complishing certain immediate purposes in instruction, as in establishing 
a single generalization or in illustrating the application of a single principle. 
Such purposes, furthermore, may frequently be served equally well by 
many other similar items of information. Some content, on the other hand, 
is so widely and so obviously useful in adult living that, if it is retained at 
all, its later utilization may safely be taken for granted. Some of this con- 


tent, furthermore, may be the only content that will serve these particular 
purposes. With such materials, one of the teacher's primary concerns 
should be with the learning, for indefinite retention, of the content itself. 
With such materials, also, the so-called content examination, granting that 
it does not test for rote memory only, is quite appropriate whether used 
during the course of instruction or at any subsequent time. There is, on this 
basis, a very definite place in educational achievement testing, both in sub- 
ject examinations and in tests of general educational development, for 
items of the informational type. In general, however, only a minor pro- 
portion of the total content of instruction is of this character, particularly 
in the subjects making up the program of general education at the high 
school and college levels. Only a relatively small proportion of the content 
items in the typical subject examination may be defended on these 

The case for the content examination rests, in the second place, upon the 
belief that even with content which is primarily of immediate instructional 
value only, there is some positive correlation between the student's ability 
to recall this content and the extent to which he has derived any general- 
ized values from it and will retain them permanently. Facts which have 
been well organized in the student's mind will probably be better remem- 
bered than those which are unorganized or unrelated. The formulation 
and retention of a generalization probably helps the student remember 
the facts used to establish it. Because of this incidental relationship, a con- 
tent examination may always have some validity as an indirect measure of 
the attainment of certain ultimate objectives. The relationship, however, 
has never been shown to be very high. Few instructors would be willing to 
have the eff^ectiveness of their own instruction judged, on an absolute basis, 
by the performance of their students even on their own content examina- 
tions, if these examinations were "sprung" on their students several years 
after they had taken their courses, nor would many instructors care to have 
their students so judged, even on a relative basis, as to the extent to which 
they had profited individually from these courses. 

All fundamental objectives of education are ultimately concerned with 
the modification of behavior. The extent to which any individual is genu- 
inely educated depends not upon what he knows nor upon the amount of 
information that he has acquired, but upon what he is able to do. For the 
reasons just noted, there is, nevertheless, some positive correlation between 
the extent of the student's knowledge and his true educational status. It is 
not surprising, In view of the ease and reliability with which the extent of 
the student's knowledge may be determined, that the earliest efforts to 
objectify the measurement of educational achievement relied so strongly 


on this relationship — that is, that the first objective tests were almost ex- 
clusively of the content type. This approach to the measurement of educa- 
tional achievement, or, particularly, of the attainment of ultimate objectives, 
has already been exploited to the full. If it represented the only available 
approach, the limits of improvability of educational achievement tests v^ould 
already have been reached. Fortunately, this is not the case. The opportuni- 
ties for further improvement are unlimited, but they lie in the direction 
of more direct measurement of ultimate and general objectives, rather than 
of more reliable measurement of immediate and content objectives. 

One very important implication of these limitations of the content type 
of examination remains to be considered. It follows, both from the inter- 
changeability and the temporary value of most of the content of instruction, 
that the typical subject examination is particularly inappropriate or lacking 
in validity when used with individuak who not only have not recently taken 
a course of the same title, but who have never taken any such course. That 
is, the typical subject examination is most inappropriate when used with 
individuals whose educational development in the particular area covered 
by the test has been acquired from informal out-of-school experiences or 
incidentally through the study of other subjects, or when used with indi- 
viduals who are largely or wholly self-educated. 

It was the recognition of these limitations of available subject examina- 
tions which led, in part, to the construction of the United States Armed 
Forces Institute's Tests of General Educational Development. The authors 
of these tests were given the responsibility of selecting or constructing a 
battery of tests that could be used to determine the general educational 
status, or the appropriate placement in a program of general education, of 
war veterans returning to school at the close of World War II. It was ap- 
parent that many of these veterans would return to school much farther 
advanced educationally than when they had left school to enter the service — 
that while their formal education had been interrupted by the war, their 
education had continued, although perhaps usually at a considerably slower 
pace than had they remained in school. It was even more apparent that 
whatever educational development had taken place while they were in 
service would have come about in a markedly different fashion than if they 
had remained in school. Their in-service education would have been ob- 
tained, not through textbooks or classroom instruction, but through direct 
observation and experience enriched through travel, through the reading of 
newspapers, magazines, and books, through informal discussions and con- 
sideration of personal and group problems, as an incidental to specific mili- 
tary training, and through systematic study, either organized for them or 
self-initiated and self-directed. From these experiences, the veterans derived 


many of the same generalized values — of the nature reviewed in the pre- 
ceding paragraphs — that might otherwise have been acquired through 
formal school instruction, but quite obviously most of the "content" from 
which these values were derived not only differed markedly, in general, 
from the traditional content of formal school subjects, but also varied 
greatly from one individual to another. 

It was suggested early to the authors of the GED tests that the educa- 
tional status of the returning veteran might be satisfactorily determined 
through a carefully selected battery of available standardized tests cor- 
responding to the principal subjects in the typical curriculum, and that the 
veteran might be given "credit" for each subject in which his test perform- 
ance equaled that of civilian students who had actually taken and passed 
the course in question. Very little deliberation was required for the rejec- 
tion of this suggestion. It was clearly evident, for the reasons reviewed in 
the preceding paragraphs, that the use of even the best of available subject 
examinations would fail to reveal the veteran's true educational status, and 
would penalize him severely because of the manner in which his education 
had been acquired. Many veterans, for instance, who had been forced to 
leave high school before graduation had, nevertheless, through their in- 
service educational experiences, acquired the practical equivalent of a gen- 
eral high school education, and should on that basis have been granted a 
high school diploma or equivalency certificate. To have required those 
veterans to satisfy the usual school standards on a battery of subject exam- 
inations would have been just as unfair as to have required all veterans who 
had actually graduated from high school to pass such examinations again 
upon leaving the service in order to be permitted to retain their high school 
diplomas. What was needed, obviously, was a battery of tests which would 
provide far more direct measures of the ultimate outcomes of a general 
education than could be secured through any available subject examinations. 

Few educators will deny the shortcomings of the typical subject-matter 
examination in the situation just reviewed. Yet, if such examinations are 
inappropriate for measuring the general educational status of war veterans, 
they must necessarily be inappropriate for measuring the general educa- 
tional status of pupils who are yet in school. Opportunities for informal or 
incidental learning or for self-education are by no means open only to 
those who are no longer in school. On the contrary, a very considerable 
share of the educational development of pupils at any level of education 
may be attributed to out-of-class experiences. In the case of many of the 
more intelligent, intellectually curious, observant, and well-read youngsters, 
particularly those with favorable home and community environments, op- 


portunities for summer travel, and rich job experiences, and so forth, it is 
quite possible that the major share of their total educational growth has 
involved experiences outside of the classroom. With such individuals, as 
much as with war veterans or with self-educated adults in general, any 
collection of content or subject-matter examinations is certain to fall far 
short of revealing their true educational status. 

In concluding this description of the limitations of the content exam- 
ination, it may be well to note that much of what has just been said about 
these limitations would still apply, even though those examinations had 
been based entirely upon the best possible content, that is, upon content 
which, for purposes of achieving the desired ultimate outcomes in specific 
courses of instruction, was as well selected and as appropriate to these 
purposes as any content could possibly be. In consideration, therefore, of 
the fact that much of the content on which these examinations had typically 
been based in the past was not thus carefully selected, but was included in 
the curriculum and in the examination primarily as a result of tradition and 
historical accident, the limitations of examinations based upon the prevail- 
ing content objectives become all the more evident. 

The preceding discussion has been concerned primarily with the limita- 
tions of subject examinations of the content type, that is, tests designed to 
measure the extent of the student's knowledge of particular school subjects. 
Not all immediate objectives of instruction, of course, are content objectives, 
nor are all subject examinations of the content type. Of the remaining 
immediate objectives, most are concerned with the development of specific 
skills and abilities. The situation in educational achievement testing with 
reference to objectives of the latter type is, in general, much more satis- 
factory than in the so-called content subjects. This is true, in part, because 
in these areas of instruction there is frequently a much more direct and 
obvious relationship than in the content subjects between the immediate 
objectives of instruction and the ultimate objectives of the whole program 
of education. In many cases, indeed, there is no clear distinction to be made 
between immediate and ultimate objectives, that is, the ultimate objectives 
are themselves immediately attainable. Many of the things which the pupil 
is taught to do in elementary school arithmetic, for example, are exactly the 
things it is hoped he will later be able to do as an adult. Thus, tests of 
computational skill and of problem-solving ability in arithmetic, if appro- 
priate for use during the course of instruction in arithmetic, should be 
equally appropriate for the measurement of those skills at any later time 
or in any other situation. A skill may deteriorate after it has once been 
taught, of course, either because of an inadequate maintenance program in 


subsequent instruction, or because it finds no application in life outside the 
school, but the validity of the test does not deteriorate subsequent to in- 
struction, as is true of many content examinations (see page 152) . 

The foregoing is not meant to suggest, of course, that all immediate 
instructional objectives of the skills or abilities types now generally taught 
may be readily defended, nor that all skills are taught which should be 
included in the program of general education. There are many high schools, 
for example, in which all pupils are required to learn how to divide higher 
order polynomials and to factor quadratic equations, as well as many schools 
in which pupils are given no specific instruction in such obviously useful 
skills as those required in map-reading or in the interpretation of statistical 
tables. Whatever the nature of the immediate objectives of instruction, the 
fact that they are currently prevalent in teaching is no safe indication that 
they constitute a valid basis for the construction either of subject examina- 
tions or of tests of general educational development. 

Limitations Due to the Subject Organization 
OF THE Curriculum 

The modern American school accepts responsibility, in some degree, for 
every major phase of the child's development — intellectual, emotional, 
social, moral, and physical. It is particularly responsible for all aspects of 
his intellectual development. Each and every teacher in the school, whatever 
his specific assignment, shares in this comprehensive responsibility. The 
teacher of physics, for example, is not only a teacher of physics, but is 
also to some degree a teacher of reading, of correctness in speech and writ- 
ing, of mathematics, of the social studies, of the fine arts, and so forth. At 
the same time, he acts in the role of friend and counselor to the pupil, and 
advises him or influences his decisions on many educational, vocational, and 
personal problems. His job is not merely to teach physics, but to help each 
individual boy and girl become a better-educated, more effective, and 
better-adjusted member of society. 

It is, accordingly, extremely important that each teacher acquire a very 
comprehensive acquaintance with each child under his care, that he be at 
all times dependably informed concerning all major aspects of the child's 
development. There is, thus, a very definite and continuous need, in instruc- 
tion, in educational and vocational guidance, and in curriculum evaluation, 
for comprehensive descriptions of the total educational development of 
individual pupils and of groups of pupils — descriptions based as much as 
possible upon dependable and objective measurement. Because of the sub- 
ject organization of the curriculum, however, practically the only objective 
measures on which teachers, counselors, and administrators have in the past 


been able to base any descriptions of the pupils' educational status have 
been scores on subject examinations, for practically no other type of tests 
of educational achievement (with the exception of reading tests) have been 
constructed and used, particularly at the high school and college levels. 
Strenuous efforts have been made, especially by guidance workers, to make 
the most of the test information thus collected, particularly through the 
maintenance and use of individual cumulative records of test scores and 
related data. Thus far, however, these efforts have met with only indifferent 
or limited success. The descriptions of individual pupil development which 
it has been possible to derive from these test records have, at best, been 
sadly lacking in comprehensiveness, while the measures obtained have been 
extremely difficult to interpret because of lack of comparability, and have 
been incapable of yielding satisfactory indications of educational growth. 

The descriptions have been lacking in comprehensiveness, in the first 
place, for the very obvious reason that measures of an individual's educa- 
tional status at any given time have been restricted to the areas of instruc- 
tion in which he happened, at that time, to be taking courses. No measure 
of a student's mathematical ability, for example, has been obtained at all 
unless he chanced, at that time, to be enrolled in a course of mathematics; 
no measures of his development in the social studies were obtained in the 
years when he was not taking a course in that area, etc. However, even 
though all such restrictions upon the administration of subject examinations 
were removed, so that at the time of graduation, for example, a student 
might be given a subject examination in each of the subjects he took while 
in high school, the resulting description would still be seriously incomplete 
and unbalanced. This would be true, even apart from the factors noted in 
the preceding section, because of the fact that the recognized ultimate 
objectives of instruction of individual subjects do not collectively constitute 
or account for the recognized ultimate objectives of the whole program of 
general education. Satisfactory adjustment in family and marital relations, 
for example, is generally recognized as an important objective of any pro- 
gram of general education at the high school level. Yet this objective would 
rarely be found listed among the recognized and functioning objectives of 
any individual high school or college subject, and scant attention, if any, is 
given to it in any subject examination. Any general objective of this char- 
acter — for which no particular course has been offered, or for which no 
particular course or series of courses has been made especially responsible — 
tends to be neglected both in instruction and in measurement. In gen- 
eral, complex problems demanding the integrated use of knowledges and 
skills acquired in several courses or areas of instruction practically never 
appear in any subject examinations; the student's abilities to deal with such 


problems are neither measured by the examinations nor adequately de- 
veloped in formal instruction. Incidentally, these are important added 
reasons why batteries of subject examinations were judged inappropriate 
for measuring the total educational development of war veterans (page 
132), or of informally educated or self-educated people in general. 

Past achievement test records (scores on subject examinations) have pro- 
vided only very inadequate indications of individual pupil development, 
furthermore, because of the lack of comparability in the measures obtained. 
Extended consideration will be given later (chapter 17) to the problem of 
rendering comparable measures that have been secured from different tests. 
It may be noted, in advance of this description, that the, techniques now 
available for making scores comparable depend on the possibility of ad- 
ministering the different tests to the same or to comparable groups of 
individuals. Standardized examinations designed for special school subjects, 
however, have in nearly every case been standardized only for populations 
of pupils currently taking the subject for which the test was designed. This 
means, of course, that with very few exceptions no two subject examinations 
have been standardized for the same or for demonstrably comparable 
groups of individuals. Thus, a percentile norm for an algebra test, based 
upon a sample of ninth-grade pupils, is obviously not comparable to one 
established for a chemistry test on a sample of twelfth-grade pupils; nor 
is a percentile norm for an English test based on a relatively unselected 
sample of ninth-grade pupils comparable to one established for a Latin te^t 
on a different and much more highly selected sample of pupils from the 
same grade. For these reasons, trying to interpret the scores of the pupil 
on a number of different subject tests has been something like trying to 
interpret a number of measures of the same physical object when its various 
dimensions are expressed in different units of measurement, and when little 
or nothing is known of the relationships existing among the various units 

In consideration of the lack of comparability and continuity in the meas- 
ures involved, it follows obviously that no series of successive descriptions 
of individual achievement of the type obtained from subject examinations 
will provide any very meaningful indication o^ an individual's relative 
growth in the various aspects of his total development. The measurement 
of growth implies repeated measurement of the same trait, or repeated and 
periodical administration of the same or comparable tests of the same trait — 
something for which very little provision has been made (except in the 
elementary school) in educational achievement testing in the past. Scores 
on successive examinations in arithmetic, algebra, plane geometry, and 
trigonometry, for example, do not indicate how a pupil has improved in 


his ability to do quantitative thinking in general. Nor do scores earned suc- 
cessively on a geography, a world history, a United States history, and an 
economics test (no matter how "comparable" the scores in a technical 
sense) indicate how much he has progressed in his thinking about con- 
temporary social problems. Categories of subject matter are clearly not 
appropriate terms in which to describe educational growth and develop- 
ment. Single score measures of total achievement in individual school sub- 
jects cannot possibly measure well the progressive attainment of general 
educational objectives. Each of these scores is a measure of composite or 
average achievement with reference to a number of different objectives of 
instruction, and each is a different composite from every other. Satisfactory 
descriptions of total educational development can be secured only through 
tests each of which is designed to measure attainment of one general objec- 
tive only, or one homogeneous group of objectives, each of which is con- 
structed independently of the manner in which responsibility for this objec- 
tive or set of objectives is distributed among the various school subjects, 
and each of which is administered periodically to all pupils, regardless of 
their current course registrations. 

The Need for Tests of Hitherto Unmeasured 
Educational Objectives 

If the descriptions of educational development of individual students 
provided by tests are to be truly comprehensive, tests and measuring devices 
must be developed for many more educational objectives than are now being 
measured at all. In general, satisfactory tests have thus far been developed 
only for objectives concerned with the student's intellectual development, 
or with his purely rationed behavior. Objectives concerned with his non- 
rational behavior, or with his emotional behavior, or objectives concerned 
with such things as artistic abilities, artistic and aesthetic values and tastes, 
moral values, attitudes toward social institutions and practices, habits relat- 
ing to personal hygiene and physical fitness, managerial or executive ability, 
etc., have been seriously neglected in educational measurement. 

One of the reasons for the neglect in measurement of many of these 
objectives is that they have likewise been neglected in instruction, although 
either type of neglect may be regarded as a cause as well as a result of the 
other. A more potent reason is that attainment of these objectives is so 
difficult to measure, or that so little is known about how to measure them, 
just as so little is known about how to teach them effectively. (In general, 
the functional validity of tests will never far exceed the functional validity 
of instruction concerned with the same objectives, nor will the validity of 
instruction far exceed that of the tests.) Some of the reasons why these 


objectives are so difficult to measure, as well as the general nature of the 
procedures that may be employed in their measurement, are suggested in 
the latter part of this chapter. Whatever the reasons, the fact and seriousness 
of this neglect are unquestioned. The present need for nevv^ tests of hitherto 
unmeasured objectives far exceeds the need for further refinements and 
improvements in existing tests. 

Course Examinations in the Content Subjects 

There is a real need in instruction (see pages 124-26) for examina- 
tions based on the immediate objectives and upon the peculiar organization 
and content of specific courses of instruction. While such examinations are 
severely limited in their usefulness so far as evaluation and guidance are 
concerned, they do serve effectively such purposes as the maintenance of 
teaching and learning standards ( in assigning grades and determining pro- 
motion and failure), the motivation of students, and the evaluation of 
teaching effectiveness in relation to immediate objectives per se. 

A basic premise in the preceding discussions is that the major uses of 
educational tests are in evaluation and guidance, and that, therefore, the 
principal concern of individuals and agencies constructing tests for wide- 
scale use should be with types of tests best suited to these major uses. This 
idea might be carried still further. The proposition is at least debatable that 
the construction of course examinations for the so-called content subjects 
should not be the job of central test construction agencies, that the wide- 
scale use of any specific examination of this type should not be encouraged, 
and that such examinations should not be employed in wide-scale testing 
programs. The corollary of this proposition is that the construction of 
specific course examinations in the content subjects should be primarily, if 
not solely, the responsibility of the individual classroom teacher and of the 
local school system. This proposition is undoubtedly as yet considerably 
premature — teachers are not yet sufficiently well trained in test construc- 
tion to assume this responsibility, but a brief consideration of the reasons 
for the proposition should nevertheless be worth while here. 

One argument for local construction of course examinations in most con- 
tent subjects is that considerable variation from school to school in the 
content of such courses is not only permissible, but often highly desirable. 
This point was considered at length on pages 128-29 preceding, but there 
is still more to be said concerning it. In the social studies in particular, and 
only to a lesser extent in the natural sciences and the humanities, specific 
adaptation of the content and of the immediate objectives of instruction 
to local needs, interests, conditions, and facilities is clearly desirable. Some 


variation in content should also be encouraged to take advantage of differ- 
ences in the training, interests, experiences, and abilities of individual teach- 
ers. High school teachers of literature in a given state, for example, should 
not be required to teach exactly the same literary selections, but should each 
be encouraged to use the particular selections that they individually best 
like and best understand, and through v^'hich they can best impart to their 
pupils an appreciation of, and a liking for, good literature. In general, in 
all content subjects teachers should be permitted and encouraged, within 
limits, to use that content in which they individually are most interested, 
or with which they have had most personal experience and which they best 

Closely related to the preceding argument is the desirability of a high 
degree of teacher participation in the process of determining immediate 
course objectives and of selecting instructional content. If teachers should be 
encouraged to interpret general educational aims in terms of their own 
immediate objectives, and to experiment with different ways of attaining 
these objectives, clearly they should not be restricted by externally con- 
structed content examinations imposed upon them by their administrative 
superiors. It should be noted, also, that the construction and improvement of 
their own course examinations constitute an unparalleled occasion for teach- 
ers to clarify their own thinking concerning the desired outcomes of in- 

The direct measurement of ultimate objectives and the minimization of 
immediate objectives are, of course, just as desirable in locally constructed 
course examinations as in examinations intended for wide-scale use in 
evaluation and guidance. It is significant that the local test constructor 
enjoys certain important advantages in this respect over the constructor of 
tests intended for wide-scale use. This advantage may best be made clear 
through a specific illustration. Consider the problem of measuring the 
ability to comprehend literary materials. The constructor of a test intended 
for wide-scale use cannot assume that all of the examinees will be familiar 
with any particular literary selection. He must, therefore, limit his test 
questions to literary selections that are reproduced in the test itself, and 
for practical reasons he is obviously limited to a small number of very short 
selections. He is unable to raise questions calling for comprehension of 
very large units of content, or for comprehension, criticism, and evaluation 
of complete literary works. He is likewise unable to discover what the 
examinees best retain from their recent relatively free reading of complete 
works. The local test constructor, however, knowing that all of his exam- 
inees have recently shared certain reading and classroom experiences, may 


take for granted temporary recall of the salient features of these experiences, 
and is free to raise questions of the type earlier suggested. Similar advan- 
tages in test construction are enjoyed by teachers of the social studies and 
natural sciences, who know that all of their examinees have had in common 
certain specific experiences — experiences far too complex to be reproduced, 
simulated, or described in the test situation — and can base their questions 
on these shared experiences. 

The suggestion that a locally constructed course examination may be 
based upon the peculiar organization and content of the particular course 
involved, however, does not imply that the memorization of such content 
for its own sake is a legitimate end of instruction, or that such examinations 
should hold students responsible for that content for its own sake. The test 
questions may be "based upon" that content in the sense that they take for 
granted temporary recall of specific experiences (content) on the part of all 
examinees, but the questions should reveal differences among the students, 
not in the extent to which they have (temporarily) memorized the content, 
but in the extent to which they have derived the desired lasting outcomes 
from it. Since the recall is to a large extent only temporary, such examina- 
tions may rapidly deteriorate in validity after the students have completed 
the course, but when used during the course of instruction, such examina- 
tions may be definitely superior to any constructed for wide-scale use (see 
pages 128-29). 

To take advantage of the aforementioned opportunities, the typical 
teacher must be very much better trained in the art and technique of test 
construction than he is at present. Until teachers are better trained in this 
respect, a considerable amount of centralized construction of course exam- 
inations for the content subjects seems desirable. In the long run, however, 
leaders in educational measurement should be more concerned with improv- 
ing teacher competence in this respect than with continued production of 
course examinations for wide-scale use. 


The principal points made in the preceding discussion may be briefly 
summarized as follows: 

1. The contribution of educational measurement to education generally 
depends as much or more upon what test constructors elect to measure as 
upon how well they measure whatever they do measure. 

2. Test constructors have, in the past, been too exclusively concerned 
with measuring the attainment of the traditional or prevailing immediate 
objectives of instruction in special school subjects. They have exhibited a 
relatively uncritical attitude toward these objectives, or have too often 


accepted them without question as an adequate and authoritative basis for 
test construction. 

3. Tests intended for wide-scale use, when based upon the most preva- 
lent of the immediate objectives of current instruction, not only fail to 
measure well many of the significant outcomes of instruction, or to provide 
an adequate basis for educational guidance and the evaluation of instruc- 
tion, but tend in themselves to perpetuate doubtful instructional and cur- 
riculum practices. 

4. For purposes of individualization of instruction, guidance, and cur- 
riculum evaluation, much greater emphasis in test construction should be 
placed upon relatively direct measurement of the ultimate objectives of the 
entire educational program. For many of these objectives, no tests of any 
kind or quality are now available. Greater effort must, therefore, be made 
to provide more co^nprehensive descriptions of the students' general or 
total educational development at various levels, through comparable tests 
of such objectives. Such tests or test batteries should be planned and con- 
structed quite independently of the present content and organization of 
school instruction, but in line with trends which are or should be developing 
in content and organization of instruction. 

5. In practice, greater emphasis must be placed upon the periodic ad- 
ministration to all students, without regard to present course registrations, 
of comprehensive batteries of tests of the type just described. It is only in 
this way that comprehensive descriptions of the educational growth of 
individuals may be obtained. 

6. Test constructors generally must assume much more responsibility 
than heretofore for the definition and clarification of ultimate educational 
objectives. Test constructors must exploit much more fully the potential 
values of tests and test construction techniques in the identification and 
clarification of objectives. (See chapter 2.) 

7. The responsibility for the construction of course examinations in the 
so-called content subjects should be increasingly taken over by classroom 
teachers and local systems, and teachers in general should be better trained 
to assume this responsibility. 

Basic Approaches in Educational Measurement 

Direct vs. Indirect Measurement 

Considerable emphasis has been placed in the preceding discussion upon 
the importance of measuring as directly as possible the ultimate objectives 
of instruction. The remainder of this chapter will consider the extent to 
which educational achievement can be directly measured, and will discuss 


certain aspects of what might be termed the major strategy of achievement 
test construction, as contrasted with the more detailed and specific prob- 
lems involved in writing the individual test exercises. 

An educational achievement test may be described as a device or proce- 
dure for assigning numerals (measures) to the individuals in a given group 
indicative of the various degrees to which an educational objective or set 
of objectives has been realized by those individuals. Whether or not an 
educational objective has been realized in any individual can be ascertained 
only through his overt behavior. Indeed, in the last analysis, any educa- 
tional objective is, in general terms, to condition or predispose the indi- 
vidual so that he will behave in a certain way in a certain situation. Accord- 
ingly, a test of any objective may be regarded as consisting in part of a 
situation or series of situations designed to elicit the desired behavior, or 
some other behavior which is presumably related to and will, therefore, 
predict the desired behavior, and in part of a procedure for assigning 
numerals to the properties of the behavior thus elicited, or to the product of 
that behavior. 

The only perfectly valid measure of the attainment of an educational 
objective would be one based on direct observation of the natural behavior 
of the individuals involved, or of the products of that behavior. One of the 
objectives of high school instruction in the social studies, for example, 
might be "to so predispose the student that as an adult he will, at every 
opportunity, exercise the right to vote at elections of important public 
officials or on important public issues." The ultimate and conclusive test of 
the effectiveness of instruction in relation to this objective would require 
a tabulation of the number of' times the individual had an opportunity to 
vote at such elections, and of the number of times he took advantage of 
this opportunity during his lifetime. This, and only this, would constitute 
a wholly direct, or perfectly valid, measure of the attainment of this objec- 
tive. Numerous examples of what is meant by perfectly valid measurement, 
which are in every instance examples of direct measurement, are given in 
chapter 16 on "Validity," and, to avoid duplication, no further illustrations 
will be given here. 

Direct measurement, then, is that based on a sample from the natural, 
or criterion, behavior series for each individual involved. Indirect measure- 
ment, on the other hand, may be defined as that based on behavior which 
is not a part of, but which is presumably related to, the criterion series. 
For example, one might attempt to measure indirectly the objective referred 
to in the preceding paragraph by noting the proportion of times that each 
high school student takes advantage of his opportunities to vote in school 
elections, on the assumption that those who most frequently exercise the 


right to vote in these situations will, in general, be those who will most 
frequently do so later as adults. 

It is only rarely possible, and even then not always practicable, to secure 
direct measures of the attainment of an educational objective for students 
yet in school. The reasons for this are fairly obvious, but a brief review 
of these reasons should nevertheless be worth while, particularly since they 
may contain some suggestions for the improvement of indirect measurement 

In the first place, direct measurement of educational achievement is often 
impossible or impracticable because of the ultimate character of the objec- 
tive involved, or because of the delayed appearance of the desired behavior. 
One illustration of this has already been given. While direct measurement 
of the "disposition to vote" objective is theoretically possible, obviously 
school teachers and counselors cannot wait for measures thus obtained, but 
must attempt by whatever indirect measures are available to predict now 
what each individual's criterion behavior may eventually be like. 

A second reason why direct measure is often impracticable is that so often 
the natural behavior series is inaccessible to the examiner, or cannot readily 
be observed by him (for reasons other than the delayed appearance of that 
behavior) . For example, one objective of instruction in high school science 
might be "to enable the student to make minor repairs of household me- 
chanical and electrical equipment and installations." In this case, a part, 
at least, of the natural behavior series is accessible to the examiner, at least 
so far as time is concerned, although the complete series extends through 
the entire lifetime of the individual. Obviously, however, it is not at all 
convenient or practical for the examiner to make firsthand observations of 
the criterion behavior or to inspect its product. 

A third reason why direct measurement of educational achievement is 
usually impracticable is the infrequency of current occasions for the speci- 
fied behavior. To take an extreme example, the occasions on which a student 
in a life-saving class in swimming has to make use of what he has learned 
are altogether too few to base any useful measure of course achievement 
upon them. In this case, of course, a very close and satisfactory approach 
to direct measurement can be secured in simulated situations, but, unfor- 
tunately, this is not generally the case in the measurement of other types of 

A fourth obstacle to the direct measurement of educational achievement 
is the lack of comparability in accessible behavior samples for different indi- 
viduals. One of the objectives of elementary school language instruction, 
for example, is to develop in the pupils the habit of spelling correctly the 
words which they will be using in their own writing. This objective implies 


not only that the pupils should spell correctly the words which they are now 
using, but also that as they continue to write more complex and varied 
materials they will spell correctly the new words needed. Here a sample of 
the products of the pupil's natural behavior is certainly immediately acces- 
sible. It would be quite feasible, for example, to collect practically every- 
thing that the pupils write during a given time period, say during one 
school semester, and to make an accurate count of the spelling errors com- 
mitted by each pupil in his own writing. In this case, however, it would 
be extremely difficult to derive from these counts anything that might be 
regarded as comparable measures of spelling ability as defined by the ulti- 
mate objective. One pupil might be a very poor speller, yet he may now 
be attempting only a very limited amount of writing about simple subjects, 
using a very simple and restricted vocabulary, and may consciously or un- 
consciously be avoiding the use of words the spelling of which he is uncer- 
tain. Another pupil, who may be a very superior speller, may attempt to 
write at length about subjects demanding a very extensive and difficult 
vocabulary, and hence may create many more opportunities to misspell 
words than the first pupil. The number of spelling errors, even though 
expressed as a proportion of the number of running words or of the number 
of different words used, would hardly reveal the true differences in spelling 
ability in these pupils. 

A fifth reason why direct measurement of educational achievement is 
seldom attempted is that it is often so costly in time and effort, or siD 
inefficient, and hence may not be practicable even though it is otherwise 
possible. The foregoing example provides an excellent illustration of the 
inefficiency of direct measurement. To secure a count of spelling errors in 
the pupils' own writing, it would be necessary, in order to attain a minimum 
reliability of measurement, to examine thousands of words of writing for 
each pupil. Yet among the thousands of words examined, only an extremely 
small proportion would constitute spelling problems at all, and the great 
majority of the words inspected would contribute nothing at all to the aim 
of discriminating between good and poor spellers. Furthermore, the factor 
of illegibility of the pupils' writing would often be confused with that of 
actual misspelling and would serve, furthermore, to increase the time and 
effort required to make an accurate error count. The use of a list dictation 
test, in which the pupils write only words that are known to discriminate 
sharply between good and poor spellers, would obviously result in much 
more efficient and economical measurement. It is a fairly safe generalization 
that most "natural behavior series" consist in very large part of elements 
that are of zero or near-zero difficulty, or that are otherwise nondiscriminat- 
ing for measurement purposes, and that relevant elements in the criterion 


series are usually associated with many irrelevant, distracting, or misleading 
elements that tend seriously to lower the validity of the observations made 
for purposes of measurement. 

A sixth obstacle to direct measurement of educational achievement, 
closely related to that just considered, is the relative complexity of most 
criterion behavior series, and the difficulty of analyzing out, or of isolating 
for observation and measurement, those elements of the total complex that 
are relevant to a given measurement purpose. In the natural behavior situa- 
tions demanding arithmetic reasoning on the part of a ninth-grade pupil, 
for example, the reasons for his failure to arrive at a correct solution to a 
problem may often have little to do with his arithmetic ability as such, but 
may be the result of distracting influences, of emotional factors, and so 
forth, which may be extremely difficult to identify and to separate from the 
factors involved in arithmetic reasoning per se. How much more difficult it 
would be to identify in the natural behavior of the adult those elements 
of "good citizenship" which are attributable to a particular course of school 
instruction is too obvious to warrant discussion. 

The Four Basic Test Types 

The implications of the preceding discussion are fairly clear. Instead of 
observing the examinee's behavior in a sample of the situations which 
present themselves to the examinee in the natural course of events, the 
examiner must in most cases^ present to the examinee a number of situa- 
tions especially selected or designed to elicit such behavior. It should be 
noted that if it were possible to observe the entire criterion series for any 
two individuals, those series would almost certainly differ markedly for 
those individuals. Accordingly, the selected test situations, which are the 
same for all examinees, may not strictly be regarded as a random or repre- 
sentative sample from the criterion series for any examinee. Furthermore, 
for the sake of expediency, efficiency, and comparability, it is necessary to 
eliminate from, or to control in, the test situations many irrelevant factors 
or nondiscriminating elements which would operate or be present in the 
criterion situations. Because of these restrictions and controls, the test 
situations become relatively artificial in character, and hence the validity of 
the test may never be taken entirely for granted. The problem for the test 
constructor, then, is to devise a test series that will be as much like, or as 
closely related to, the criterion series for all examinees as considerations of 
expediency, efficiency, and comparability will permit. 

" Limited use is made, in educational measurement, of observation check lists, question- 
naires regarding previous behavior, anecdotal records, and records of specific activities such 
as reading records, and the like. 


Accordingly, the most basic classification of tests into types is that which 
depends upon the nature of the relationship of the behavior comprising the 
test series to that constituting the criterion series. Four such basic types of 
tests may be identified. It may be noted, first, that all educational objectives 
are ultimately concerned with behavior, that is, with what the examinee 
can or will do in specified situations. Accordingly, the test constructor may 
(1) give the examinee special occasion to do some of the things that are 
specified by the objective (without waiting to observe those things in the 
natural course of events ) , and assign measures on the basis of the frequency 
or adequacy with which he does those things; (2) give the examinee occa- 
sion to do things similar to some of those specified by the objective, and 
assign measures on the basis of the assumed relationship between the 
behavior elicted by the test and that constituting the criterion series; (3) 
describe the situation in which the examinee would have occasion to do 
what the objective specifies, and then ask him to tell what he would do 
in this situation or how he would do it, and (4) discover whether or not 
the examinee knows the facts, rules, principles, etc., that are presumably 
essential or conducive to the desired behavior. 

There are relatively few instances of actual tests that are purely and 
solely of one of these types only. Often the same test may exhibit some of 
the characteristics of each of these four types. There is certainly no impli- 
cation that tests should be purely of one type or the other, nor is it suggested 
that any effort should be made to classify existing tests on this basis. The 
distinctions are here made primarily for the purposes of this discussion, in 
an attempt to develop a clearer understanding of certain basic problems in 
test construction. For these purposes it should be worth while to illustrate 
and consider briefly some of the more important characteristics of each of 
these types. 

The "Identical Elements" Type of Test 

The first of these four basic types may, for want of a better name, be 
called the "identical elements" type. The most important characteristics of 
this type are that certain behavior situations are presented to the examinee 
for the special purpose of measurement, and that the elements of behavior 
elicited by the test situations are practically identical with certain of the 
elements comprising the criterion series for the individuals involved. The 
test series, however, does not contain all of the elements comprising the 
criterion series; rather, only the more discriminating, or the more readily 
reproducible, or the more crucial, or the more readily measurable, or the 
more relevant of the elements of the criterion series are selected for the 


test, and these elements may, in important respects, be quite differently dis- 
tributed than in the natural or criterion situations. 

A good example of a test that meets this description is the type of ste- 
nography test that is frequently used in business schools. The test content 
may consist of a specially prepared business letter which is in many respects 
representative of the letters that the student may later have occasion to 
type as an office stenographer and typist. This letter is dictated to the class 
of students as a group, and the students take the letter in shorthand and 
then prepare a typed copy of it from their ov^n stenographic notes. The 
letter is scored by assigning predetermined weights to such things as the 
number of typing errors made in predetermined error situations, the ar- 
rangement of the letter on the page, the neatness and uniformity of the 
carbon copies, and the time required to type the letter. The letter may be 
far from typical of actual business letters in that, for the sake of efficiency 
of measurement, it may be much more heavily loaded with opportunities 
for certain types of error than a typical letter would be, and many elements 
frequently found in business letters but nondiscriminating for measurement 
purposes may be entirely omitted from the test letter; yet the elements that 
are present in the test letter are, for all practical purposes, identical with 
those found in letters typed in the "natural behavior" situations. The test 
situation is, of course, a somewhat artificial one; many of the attendant 
circumstances may be quite different from those associated with the natural 
behavior series, but for the purposes at hand these differences or artificialties 
may be of no practical consequence. 

Most so-called "work sample" or "performance" tests are predominantly 
of the identical elements type. For example, the performance tests some- 
times given to applicants for automobile driver's license, or those given 
in shop courses in industrial and domestic arts, clearly fall in this category. 
(See pages 456-63 for further illustrations.) While these tests often con- 
tain irrelevant elements which may bias the measures obtained, nearly all 
of the things the examinees are required to do are practically identical with 
those that they would be expected to do in the natural situation. 

A particularly important and widely used test that is primarily of this 
type is the test of reading comprehension. Such tests typically consist of a 
collection of reading passages, much like those which the student would 
later have occasion to read and interpret in the real life situation, and which 
are accompanied by a series of questions requiring the student to derive 
meanings and draw inferences like those that he might have occasion to 
derive in his own free reading. As actually constructed, such tests are often 
of low validity due to poor selection of reading passages and to failure to 


devise questions that will require the students to exercise some of the more 
important components of general reading ability. In a well-constructed 
reading test, however, most of the things the student is required to do are 
essentially the same as those he does in the natural situation. In other words, 
the best reading test in itself constitutes the best available definition of 
what is meant by reading ability. 

Incidentally, it is in tests of this general type, particularly in tests of 
ability to interpret, to evaluate, and to think critically about complex read- 
ing materials, that the art of educational achievement test construction has 
perhaps reached its highest level of development to date. 

The "Related Behavior" Type of Test 

The second basic type of test is that in which the elements of the test 
series are not a part of the criterion series, but are presumably substantially 
related to the criterion behavior for the population involved. The elements 
of the test series may or may not be similar to that in the criterion series, 
the essential condition being that measures based on the two series be highly 
correlated. In educational achievement testing, however, for reasons later 
to be considered, the aim is usually to secure and insure the desired corre- 
lation by making the test series as much like the criterion series as possible. 

In actual practice, it is impossible to draw any sharp line of distinction 
between the identical elements and the related behavior types of tests. 
Actually, some elements of the test series are usually identical with those 
in the criterion series, but many of the elements in the test series are not 
present at all in the criterion series, the proportion of such elements varying 
considerably from test to test. 

One example of a test of this type has already been suggested — that in 
which the adult's voting behavior is predicted by his voting behavior in the 
student government situation. A test in which the general trait of "honesty" 
is measured by determining how frequently students cheat in a specially 
devised examination is another example of this type. Most so-called "simu- 
lated behavior" tests belong to this category, such as tests of the type used 
by the military services in personnel selection in the last war in which, for 
example, the examinee might sit in a mock cockpit behind a mock machine 
gun and "fire" at images of pursuit planes on a moving picture screen. 

The varying proportion of identical elements that may be found in tests 
of this type is readily illustrated in the measurement of spelling ability. 
As has already been noted, measures of spelling ability might be derived 
from error counts based upon the free writing of the students. Such a test 
would be almost purely of the identical elements type. To control the 
sampling of spelling opportunities from pupil to pupil, however, the device 


might be employed of dictating the same sentences to all pupils, and count- 
ing the spelling errors in these sentences. In this case, the score of any pupil 
may depend, in addition to his spelling ability, upon such things as how 
clearly the examiner dictated the sentences, how accurate is the hearing and 
how close the attention of the examinee, how rapidly the sentences were 
dictated, how many clues to the correct syllabification and spelling are pro- 
vided by the articulation or pronunciation of the words dictated, how ade- 
quately the context distinguishes some of the words from their homonyms, 
etc. Greater efficiency may be obtained by dictating a list containing only 
words known to be difficult or discriminating, but only at the cost of increas- 
ing the artificiality of the test situation. It is quite conceivable, for instance, 
that in a test situation of this type, where the pupil's attention is sharply 
focused on the problem of spelling particular words, and where he is helped 
by hearing the words pronounced, his spelling of some words will differ 
from that which he would habitually and unconsciously employ in his own 
free writing. In the list dictation test also the manner in which the words 
are dictated may again introduce irrelevant variations in the test scores. 

Still greater efficiency may be secured, and some of these irrelevant factors 
may be controlled, by providing the pupils with printed sentences or lists 
containing misspelled words, with instructions to write the correct spelling 
above each misspelled word. From this, it is but a short step to the type 
of test in which the examinee is presented with a number of groups of 
printed words, and asked merely to identify the misspelled word in each 
group, without having to provide the correct spelling. In tests of this type, 
the relative frequency with which the examinee detects and corrects, or 
only detects, spelling errors in printed sentences or lists prepared by 
others, and the relative frequency with which he would habitually and 
automatically spell the same words correctly in his own writing, may be 
far from perfectly correlated. 

The foregoing example illustrates the general truth that increased effi- 
ciency and economy in test administration and scoring and increased con- 
trol of irrelevant factors are usually secured by introducing greater and 
greater differences between the behavior actually tested and that constitut- 
ing the criterion series. In order to control irrelevant variations of some 
types, other irrelevant factors must be deliberately introduced. Greater 
comparability and greater reliability per unit of time are thus achieved, 
but frequently at a definite sacrifice in intrinsic validity. 

The "Verbalized Behavior" Type of Test 

The third basic test type is that in which selected behavior situations 
from the criterion series are described to the examinee, and in which he 


tells how he would (or how one ought to) behave in those situations. The 
description or presentation may be either verbal (oral or written) or visual 
(pictures, charts, diagrams, or movies). The examinee may be required 
to tell what he would do, or how he would do it, in his own words (writ- 
ten essay or oral), or he may select the one of a number of suggested 
descriptions of possible behavior which he considers correct or best (mul- 
tiple-choice test). This type of test is really a variation of the related 
behavior type, but is sufficiently important to deserve separate considera- 

An example of this type of test would be that in which teacher candi- 
dates are presented with a series of detailed descriptions of classroom 
situations in which problems of classroom management have arisen, in 
which each situation described is accompanied by descriptions of various 
actions that the teacher might take in handling the situation, and in 
which the examinees are to select the action that in their judgment is 
best for each situation. 

Another test of this type would be that in which medical students are 
presented with a series of descriptions, possibly supplemented by photo- 
graphs, X-rays, etc., describing the symptoms of a num^ber of patients, 
each description being accompanied by several suggested diagnoses and 
several suggested therapies, the task for the examinees being to select the 
correct diagnosis and correct therapy for each case. 

Still another example of this type would be a test in which students' in 
a high school course in Family and Marriage are presented with descrip- 
tions of a number of domestic problem situations, each of which is 
accompanied by a description of several solutions to the problem which 
might be adopted by the persons involved, the examinees being required 
to select the description that they consider represents the best solution 
in each case. 

The foregoing examples have all been of tests of the objective type; the 
manner in which the same idea can be utilized in tests of the essay type 
is obvious. 

For a great many educational objectives, the criterion behavior series is 
of such a character that it is utterly impractical to attempt to reproduce 
or to simulate in the school examination the overt behavior with which 
the objective is ultimately concerned. In many such situations, the verbal- 
ized behavior type of test may serve as a very acceptable and practicable 
substitute for the overt behavior type, and might do so even though the 
latter type could be employed. In the social studies, particularly, this is a 
very promising test type, and one whose potentialities test constructors 
have only begun to exploit. 


The Test of the Student's Knowledge 

The fourth basic type of test is that which seeks to discover how much 
the student knows about a particular topic or subject. It has been perhaps 
the most widely used of the various test types and is too well known to 
require specific illustration here. The majority of school examinations or 
standardized tests in most of the so-called content subjects have been 
almost exclusively of this type. 

The limitations and possibilities of this type of test have already been 
considered at length in the earlier part of this chapter and ways of improv- 
ing tests of the knowledge type generally will be extensively considered 
later in chapter 7, "Writing the Test Item." It will be sufficient for the 
present purposes, then, only to remind the reader of a few of the most 
important characteristics of tests of this type. 

In the first place, in evaluating tests of this type, it is especially impor- 
tant to distinguish clearly between tests of "knowledge" in the sense of 
understanding or of acquiring meanings, and tests of knowledge in the 
sense of verbalizations that have been learned by rote and are practically 
devoid of meaning and functional value to the learner. The acquisition of 
meaning is an active process, involving the drawing of inferences, the 
relating of items, the translation of statements into one's own words, the 
formulation of generalizations, the finding of illustrations and applications, 
and so forth. At the one extreme, a test of the student's knowledge may 
require the student to do these things, and may be primarily concerned 
with relatively broad concepts, basic principles, fundamental generaliza- 
tions and relationships. At the other extreme, a test of the student's 
knowledge may be concerned only with poorly selected, detailed, descriptive 
facts and may require nothing more from the examinee than the recall 
or recognition of verbal stereotypes which have been memorized but not 
understood. Unfortunately, a large proportion of educational achievement 
tests have approached closer to the latter extreme than to the former. 

There is, of course, very little justification for tests of the latter type 
in educational achievement testing. The justification for tests of the stu- 
dent's knowledge in the more desirable sense rests upon the contention 
that extensive knowledge is essential or conducive to the overt behavior 
with which the ultimate educational objective is concerned. In other words, 
the final justification for this type of test is that there is high correlation 
between how much or what an individual knows and how he will behave 
in certain situations. (Thus, the knowledge type of test may also be re- 
garded as a variation of the related behavior type.) This justification 
undoubtedly has some validity, and tests of the student's knowledge in the 
more desirable sense will undoubtedly always play an important role in 


educational achievement testing. However, while extensive and accurate 
knowledge may be an essential condition to effective thinking and to 
desirable overt behavior in certain situations, it is by no means a sufficient 
condition. The correlation between test behavior and criterion behavior for 
this type of test is therefore by no means perfect, and is particularly likely 
to be quite low where acquisition of knowledge for its own sake has 
become an end of instruction, as it so frequently has in practice. Tests 
which measure how well the student can make use of what he knows (and 
which thus indirectly measure also how much he knows) are therefore 
much to be preferred to tests of knowledge alone. 

The Fundamental Goal in Achievement Test Construction 

It has been shown that, for various reasons, direct measurement of 
ultimate attainment of educational objectives is, in most instances, impos- 
sible or impracticable in the school situation. At best, the educational test 
constructor can elicit in the test situation only a limited sample of the 
behavior elements constituting the complete criterion series. Usually, he 
must be content to substitute for many elements of the criterion series 
definitely dissimilar (but presumably related) elements of behavior that 
are immediately and readily accessible to him. Nevertheless, /'/ should 
always be the fundamental goal of the achievement test constructor to make 
the elements of his test series as nearly equivalent to, or as much like, the 
elements of the criterion series as considerations of efficiency, comparability, 
economy, and expediency will permit. The more nearly the test itself com- 
pletely defines the ultimate educational objective involved, the more satis- 
factorily the test will serve its many purposes. The aim of the test con- 
structor is thus always to make his test as much of the identical elements 
type as he possibly can, and to resort to the use of the other types only 
when no other procedure is at all practicable. 

It may sometimes be possible, with a test series that differs considerably 
in character from the criterion series, to demonstrate quite conclusively 
that a high relationship exists between measures based on the two series 
for the population for which the test is intended. It is very important to 
observe, however, that this in itself does not constitute adequate justifica- 
tion for the widespread and continued use of the particular test involved. 
This is because the widespread and continued use of a test of the char- 
acter just described will, in itself, tend to reduce the correlation between 
the test series and the criterion series for the population involved. Because 
of the nature and potency of the rewards and penalties associated in actual 
practice with high and low achievement test scores of students, the be- 


havior measured by a widely used test tends in itself to become the real 
objective of instruction, to the neglect of the (different) behavior with 
which the ultimate objective is concerned. The students attain greater 
proficiency in doing what the test requires, without improving themselves 
in the criterion behavior, and the correlation between the two is lowered. 
Thus, the test may not only exercise an undesirable effect upon instruc- 
tion and learning, but the validity of the test itself deteriorates more and 
more as the test is more widely used. This may occur not only with tests 
of the related behavior type, but with tests of the identical elements type 
as well. In the latter type of test, not all of the elements of the criterion 
situations are included in the test series, and those not included tend to be 
neglected in instruction, or learning, or both. The effect, of course, is most 
serious with tests of the related behavior type. Thus, tests concerned only 
with verbalized behavior are likely to encourage undue emphasis upon 
verbalization in learning, or tests concerned only with the extent of the 
student's knowledge are likely to cause neglect in instruction of the func- 
tional values of that knowledge to the learner. 

The practice of cramming for educational tests, although perhaps not 
a very serious problem generally, is of interest here because it serves to 
bring into sharper focus the problem just raised. Cramming for tests is 
really undesirable only to the extent that it fails to promote, or actually 
interferes with, the attainment of the ultimate objective with which the 
test is concerned. If the test encourages intensive but temporary memoriza- 
tion of ill-digested facts or meaningless verbalizations, for example, then, 
of course, the result is bad. If, on the other hand, the test encourages the 
student to do what he ought to do in any event, that is, if the test series 
is identical with the criterion series, then cramming is all to the good, and 
the more of it the better. Interestingly enough, the more nearly the test 
itself adequately defines the educational objective involved, the less does 
cramming for the test tend to be practiced at all. For example, there is 
never much of a problem of cramming for a good reading comprehension 
test. Students and teachers soon learn that the only way to secure high 
scores on such tests is really to improve in reading ability, something which 
they readily acknowledge that they do not know how to accomplish in the 
few hours preceding the examination. 

The principal danger in achievement test construction, then, is that the 
behavior comprising the test series may not be sufficiently representative 
of the entire criterion series. As has already been noted, this is true of the 
identical elements type of test as well as of the others. In constructing tests 
of the identical elements type, the tendency is to limit the test series to the 


elements of the criterion series that are most conveniently and most easily 
reproduced, or most easily and objectively observed and evaluated, and 
many of the more unmanageable but more important and crucial elements 
tend to be neglected in, or omitted from, the test. Similarly, in construct- 
ing tests of the related behavior type, the tendency is to simulate or to 
reproduce verbally only those elements of the criterion series which can be 
most easily simulated or reproduced. 

Complex vs. Simple Tests and Test Exercises 

There is one element that is common to a great many criterion series, 
which tends particularly thus to be neglected or ignored in the construc- 
tion of achievement tests. Attention has previously been drawn to the 
great complexity that frequently characterizes the behavior comprising the 
criterion series for an ultimate educational objective. This complexity may 
in itself constitute the very essence of the criterion behavior, or may be 
regarded as the most important single aspect of, or element in, that be- 
havior. It is rather common practice in achievement test construction to 
attempt to analyze a complex criterion behavior into relatively pure or 
simple elements or traits, to attempt to measure each of these elements 
separately or independently, and then to combine these measures into a 
single composite score for the purpose of the action judgment which must 
eventually be made. This tendency has possibly been encouraged by the 
emphasis placed upon analytical or factorial analysis procedures in apti- 
tude and psychological testing. In certain respects, this influence may have 
been an unfortunate one. Eventually, test theory and technique may advance 
to the stage where a composite measure thus obtained will accurately de- 
scribe or evaluate the complex behavior of the criterion series, but, for the 
present, it seems best to attempt to incorporate in the achievement test 
situation as much as possible of the same complexity that characterizes 
the criterion situation. This is particularly true in tests of critical thinking 
or tests of the ability to interpret and evaluate complex materials in the 
social studies, the natural sciences, literature, etc. In such tests the most 
important consideration is that the test questions require the examinee to 
do the same things, however complex, that he is required to do in the 
criterion situations, even though it may consequently be very difficult to 
classify these questions into clear-cut homogeneous categories, or to estab- 
lish meaningful part scores for such categories. In building a reading com- 
prehension test, for example, one can be too much concerned with the 
attempt to have certain items measure only the "ability to note details," 
others measure only the "ability to organize ideas," and still others only 
the "ability to infer meanings of words from context," to the extent that 


he may fail to employ more valid questions which have occurred to him, 
but which he excludes because they are too complex in character to be 
readily classed in categories of the type suggested. As a result, the essence 
of the complex behavior may be analyzed out, because the sum of the 
parts may not be equal to the whole. 

Closely related to the problem of reproducing in the test series the es- 
sential complexity of the criterion series is that of reproducing or simulat- 
ing in the test series the full scope and variety of the criterion series, or of 
including in the test all of the relevant components of the natural behavior 
situation in their proper relation to one another. The tendency is greatly 
to oversimplify the test situation and to exclude from the examinee's con- 
sideration many specific elements that might prove crucial in determining 
his behavior in the corresponding natural situation. In tests of the verbal- 
ized behavior type, in particular, the tendency has been to make the de- 
scriptions provided altogether too short to accomplish their purpose well. 
In actual classroom management situations, for example, the action taken 
by the teacher would often depend upon a very intimate acquaintance with 
the personalities and past performances of all of the individual pupils in 
the group, and only a very skillful examiner could succeed, even in several 
pages of description, in presenting many of the more subtle factors that 
would influence the teacher's judgment in the real situation. Likewise, it 
would be extremely difficult to anticipate in the medical diagnosis test all 
of the characteristics of the patients which might be regarded by some 
examinees as symptoms of a particular disorder, and on which their diag- 
nosis might hinge. Frequently, in the real situation, one of the major 
difficulties involved in arriving at a sound judgment is that of identifying 
the elements in the total situation that are truly relevant to the particular 
problem in hand. The ability to distinguish relevant from irrelevant factors 
is thus an important part of the total ability to be measured, and the 
situation described in the test must therefore contain many elements which 
are irrelevant to the solution of the problem presented to the examinee, 
but highly relevant to the purposes of the test. 

Still another aspect of the need for more complex, as well as for simpler, 
tests and test exercises deserves specific mention. Great emphasis has been 
placed, in discussions of the curriculum and of methods of instruction, on 
the need for closer coordination or integration of instruction, and on the 
desirability of giving students more occasion to use together, in an inte- 
grated attack on complex problems, the many skills, knowledges, and 
abilities which they have acquired in different school subjects and at dif- 
ferent times. Tests that may be used to evaluate the extent to which instruc- 
tion has been effectively integrated, and that will place an effective premium 


upon such integration, are as sorely needed as integration of the content 
and methods of instruction of itself. Obviously, only tests of a highly 
complex character can adequately fill this need. 

In connection with this discussion of test complexity, it may be noted 
that for many practical purposes in education, too many test scores may 
be worse than too few. It would be quite possible to analyze a high school 
student's total educational development into, say, a hundred specific ele- 
ments or aspects. Few teachers or counselors, however, would be capable 
of interpreting such a mass of test data, particularly in view of the prob- 
able lack of comparability in the many measures presented. In general, 
such data, however numerous, are used as the bases for a very small num- 
ber of critical action judgments. Since any action judgment must in any 
event be based upon no more than a few composites, it might very often 
be better to use as those composites the scores on a few realistic and com- 
plex tests, rather than the weighted combinations of scores on a very 
much larger number of simpler and more homogeneous subtests which 
do not adequately define the whole of what the tests are intended to 

The foregoing is certainly not meant to imply that there is no place for 
diagnostic or analytical testing in educational measurement. On the con- 
trary, the description of an individual provided by test scores should 
always be as analytical a description as can be advantageously used for 
the practical purposes in hand. A reading test intended for purposes of 
general educational guidance, for example, might best be of a complex 
character, while one intended to serve as a basis for diagnosis leading to 
remedial reading instruction with problem children may have to consist 
of quite a number of relatively homogeneous parts. Even here, however, 
teachers must be cautioned not to assume that the desired whole will 
necessarily result automatically from isolated attacks on the parts identified 
by the test. 

The Limitations of Written Examinations 

Reference has been made earlier to the fact that many of the more 
important objectives of the whole program of general education have thus 
far received little or none of the attention of measurement workers. This 
fact is closely related to the extremely heavy reliance that has been placed 
upon pencil-and-paper techniques of measurement in education. Written 
examinations, in general, are well adapted only to the measurement of 
the intellectual aspects of the student's educational development, or of his 
rational behavior. The verbalized behavior type of test, for instance, may 
often serve as a fairly satisfactory substitute for direct measurement of the 


overt behavior specified by an ultimate objective, but only if that behavior 
is almost completely determined by logical considerations. For example, a 
well-constructed medical diagnosis test (see page 150) might work quite 
well in this regard, but even the most skillfully constructed test for a 
course in Family and Marriage (see page 150) might fail to predict well 
the actual behavior of the examinees in the corresponding natural situa- 
tions. In the examination situation, the student may profess certain atti- 
tudes or beliefs, or indicate that he would act in a certain fashion, simply 
because he knows that certain responses are socially approved, and that 
only such responses will be given credit in scoring the test. 

For many objectives of the type suggested, measuring or observational 
devices concerned with the student's overt behavior while he is yet in 
school, even though of a rough and opportunistic character, are perhaps 
much more worth while than any written types of tests. The actual be- 
havior of the student in school elections, for example, however fragmentary 
or limited in sampling, may provide a better clue to his future behavior 
than anything he professes that he will do in a written test; books actually 
read by the pupil in his free time may constitute a truer indication of his 
literary tastes than his score on a literary appreciation test; anecdotal 
records may be more meaningful than scores on personality tests; etc. The 
problem of deriving comparable measures from such opportunistic observa- 
tional data now seems to present almost insuperable difficulties, but per- 
haps no worse than have been resolved before through determined and 
persistent effort. 

This volume — especially that part of it concerned with test construction 
— deals almost exclusively with written examinations. This is because it is 
only with such examinations that measurement workers in education have 
acquired any large body of experience and knowledge that can be handed 
on in meaningful form to the beginning student. Many of the principles 
of test construction, however, and most of the theory of measurement 
herein presented, are generally applicable to any and all types of tests and 
measuring devices. It is much to be hoped that, in the future, educational 
measurement workers will apply these principles and suggestions to the 
many areas now so sorely in need of attention. 


Most of what has been said in this chapter represents expression of 
opinion only. Few if any of the statements made could be closely docu- 
mented or conclusively substantiated by concrete experimental evidence. 
Some of' these statements would be subscribed to by practically all leaders 
in this field; on others there might be sharp disagreement. The one thing 


on which all certainly would agree is that the questions here considered 
represent the kinds of questions to which measurement workers generally 
should devote a much greater share of their attention. If measurement is 
to continue to play an increasingly important role in education, measure- 
ment workers must be much more than technicians. Unless their efforts 
are directed by a sound educational philosophy, unless they accept and 
welcome a greater share of responsibility for the selection and clarification 
of educational objectives, unless they show much more concern with what 
they measure as well as with how they measure it, much of their work 
will prove futile or ineffective. 

The author's primary purpose in this chapter, therefore, has not been 
so much to suggest answers to the questions raised as simply to raise the 
questions, or to indicate the general direction of the thinking which should 
precede the selection of any specific test construction project. If the an- 
swers that have been suggested may at times have seemed dogmatic in 
character, this may itself contribute to the central purpose of stimulating 
more critical consideration of the questions raised. 

6. Planning the Objective Test 

By K. W. Vaughn 

Formerly with Cooperative Test Service 

Collaborators: Dorothy C. Adkins, University of North Carolina; 
Louise Witmer Cureton, University of Tennessee ; Frederick B. Davis, 
Hunter College ; Geraldine Spaulding, 'Educational Ke cords Bureau 

Planning is an essential activity in all stages of a test 
construction project. Inattention to planning may result in a failure to meet 
production deadlines, or may necessitate the use of uneconomical pro- 
cedures or of below-standard materials in order to meet those deadlines. 
It may mean that certain desirable procedures will prove unusable in the 
later stages of the project because inadequate foundation was laid for 
them. It may result in the wasteful preparation of more items than are 
needed in the finished test, or in a failure to prepare enough materials to 
survive the tryout. In general, it may lead to countless annoyances and 
delays due to a failure to coordinate properly the various phases of test 

Test planning encompasses all of the many and varied operations that 
go into producing a test. Not only does it involve the preparation of an 
outline or table specifying the content or operations to be covered by the 
test, but it must also involve careful attention to item difhculy, to types 
of items, to directions to the examiner, to arrangements for tryout, to prob- 
lems of test reproduction, to provision for expert review, to the provision 
of adequate equipment and facilities, to the procurement of personnel, 
and so forth. These are only a sample of the many operations in, and 
aspects of, test construction that demand planning. This chapter thus 
provides a brief orientation to a number of problems that are more 
thoroughly discussed in later chapters. The major concern of the chapter, 
however, is not with the details of how the various difficulties involved 
in test construction are to be met, but rather with the need for anticipating 
these difficulties before they arise, of coordinating and tying together the 
various operations involved, and of insuring smooth and efficient adminis- 
tration of the project as a whole. 



Defining the Purpose of the Test 

Most test construction projects are originally undertaken with only a 
general or somewhat indefinite conception of their purpose in the minds 
of the test constructors. Perhaps the most basic step, therefore, is to 
analyze and clarify these general objectives so that the purposes of the 
test can be stated in specific, concrete terms. 

For some types of tests (called "predictor" tests in chapter 9), satis- 
factory criterion data may make possible the computation of meaningful 
validity coefficients. In such a case the general statement of the purpose 
of the test might be to predict a particular criterion, and the final check 
on the test would be its relationship to this criterion. For most achievement 
tests, however, meaningful validity coefficients cannot be obtained. The 
validity of a test of this kind can be estimated only by subjective judg- 
ments regarding the extent to which it measures what it is intended to 

For either type of test, there should be available a general statement of 
the test purpose. Such a general definition of objectives should specify 
what is to be measured, who is to be tested, and what uses are to be 
made of the test scores. The purpose of a French test, for example, might 
be stated as follows: 

French, as defined for the purposes of this test, consists of knowledge of 
French words, ability to translate French prose, knowledge of French 
grammar, and information about France and French culture. This tesb is 
to be administered to public school pupils throughout the United States 
who are completing two to six semesters of French and who are in grades 
nine, ten, eleven, and twelve. The resulting scores are to be used as one 
factor in assigning grades. 

So general a statement, of course, serves merely as a starting point, and 
must be broken down into a much greater detail to provide a meaningful 
guide to the item writer. 

In the case of predictor tests for which a criterion measure is available, 
use of a carefully prepared outline of test content is the best guarantee 
that all measurable relevant abilities will be covered. Before the items are 
actually correlated with the criterion, the judgment of the examiner must 
be applied in attempting to secure adequate coverage. The statistical anal- 
ysis serves as a check on his judgment. For self-defining achievement tests, 
a fairly comprehensive outline of test content is perhaps even more es- 

Sometimes the statement of purpose for a test leads at once to the 
formulation of a test outline; at other times considerable labor is involved. 
Particularly for educational achievement tests of the self-defining type. 


the detailed definition of test purpose may depend upon the analysis of 
behavior, of jobs, of textbooks, or of curriculums, and may entail time- 
consuming consultation with subject-matter specialists. In some cases the 
definition of purpose may be confined to a fairly informal sketch of topics 
to be covered, particularly when the purpose is very narrow, or when it 
is prepared for highly competent test constructors. The labor in preparing 
the test outline varies with the nature and purpose of the test and with 
who is to build it. 

Preparing an Outline of Test Content 

Two Aspects of Test Content 

The term "test content" has been used rather broadly to cover both the 
subject matter of the test or indivdual test item and the type of ability 
that it is thought to require. 

Sometimes these two aspects of test content can conveniently be treated 
together. For example, if "ability to compute the arithmetic mean from 
grouped data" is listed as a test topic, both the subject matter and type 
of behavior to be tested are at once clear. 

On the other hand, often the specific content might well be separated 
from the type of behavior to be tested and attention be directed specifically 
to each aspect separately. Too often a test outline is confined to subject 
matter alone, and the type of behavior to be tested is left to the judgment 
of an inexperienced item writer. One of the later illustrations shows a 
convenient way of segregating both aspects of test content in a two-way 
table (page 162). Thus, a test outline can be used to show clearly not 
only what different areas of subject matter are to be covered but also the 
types of behavior to be elicited with respect to each area. As will be seen, 
it can also indicate the relative emphases of the various topics and of the 
several types of behavior. Moreover, it can reveal the relative emphases on 
different types of behavior for each subject-matter area, as well as the 
relative emphases on dififerent topics for each type of behavior. 

Ways of Determining Test Content 

Analysis of objectives of measurement 

Since the fundamental objectives of education are ultimately concerned 
with the modification of human behavior, it is advisable to analyse the 
objectives of instruction to determine what activities and skill should be 
appraised in an educational test. The practical difficulties that often pre- 
vent satisfactory measurement of the ultimate objectives of education are 



described in chapter 5 and will not be repeated here. All objectives pre- 
ferably should be included in the test outline. Explicit note should be made 
that any particular objective cannot be or is not being measured. 

An illustration of the development of a relatively broad test outline is 
provided by the general plan for the 1948 form of the Premedical Science 
Achievement Test of the Association of American Medical Colleges (^). 
The items were distributed as follows by objectives of instruction and sub- 
ject-matter fields: 

of Items 

1. Objectives of instruction 

a) Comprehension, interpretation of scientific materials 35 

b) Application of concepts, principles, etc 30 

c) Recall of basic concepts 35 



Subject-matter field 

a) Biological sciences 40 

b) Chemistry 40 

c)- Physics 20 



Combined in a two-way table, these specifications led to the breakdown 
shown in Table 5. 


Number and Percent* of Items in Each Category 
OF the 1948 Premedical Science Achievement Testj 

Objectives of iNSTRUcnoNt 


Recall of Basic 

Application of 
Concepts, Princi- 
ples, etc. 





N Percent 





Biological sciences 
























53 .?5 



* Rounded to the nearest whole number. 

t Reproduced from an unpublished report to the Association of American Medical Colleges by the Graduate 
Record Office, December 1947. 

% Items classified in more than one category are assigned fractional weights in these categories. 

It should be made clear that in practice a much more detailed outline 
of the contents within each cell of a table such as Table 5 is needed before 
test construction proceeds. If such an outline is not prepared by persons 
thoroughly acquainted with the subject-matter fields and with the purposes 
of the test, too great a share of the test planning is likely to be done by 
the item writers. Even if the item writers are equally competent in the 


subject-matter fields, detailed breakdown of the content to be covered at 
the stage of test planning almost always is worth while in insuring 
adequate coverage and in avoiding uneconomical construction of too many 
or too few items on a particular topic or requiring a particular type of 

In practice it may not be feasible to attempt to adhere very rigorously 
to the weights indicated in the cells of a table such as that presented in 
Table 5. Frequently the same item may be classed in two or more cells, 
or ideas for good items may be more plentiful for some cells than for 
others, or more items may survive the tryout in some categories than others. 
Furthermore, as will be made clearer later, the weight actually given to a 
category is by no means a function only of the number of items assigned 
to it, and the number of items indicated in the table is only a very rough 
estimate of the number really needed to secure the desired weight. Ac- 
cordingly, the weights indicated in the original table may be altered some- 
what during the course of the test construction, as sound reasons for 
doing so are encountered. Tests intended to measure achievement in 
particular courses of instruction must, to some extent at least, be based 
upon what the pupils were actually taught, rather than upon what some- 
one may think should be taught. However, the limitations that this pro- 
cedure imposes on the test so far as its uses in evaluation are concerned 
must be clearly recognized (see pages 123-34). 

. A second illustration, confined entirely to the analysis of behavior 
objectives, is provided by the following list prepared in the course of 
planning the college chemistry test of the United States Armed Forces 
Institute. There is no standard introductory course in chemistry offered 
by most colleges and universities, but there is a fairly universal agreement 
on certain topics that should be taught in more or less the same sequence. 
Furthermore, the objectives of instruction in chemistry are reasonably 
well agreed upon. The outline for the USAFI examination in college 
chemistry, therefore, was designed to link the basic objectives of instruc- 
tion to the subject matter generally taught in most colleges and uni- 
versities (2). 

Behavior Objectives 

I. Ability to recall important information 

A. Knowledge of important facts 

B. Knowledge of definitions of important terms 

C. Acquaintance with important concepts 

D. Verbal understanding of theories and principles 

E. General knowledge of the physical properties and chemical behavior 
of the more important elements and their compounds 


11. Ability to apply principles in making simple predictions 

A. Functional understanding of the principles and theories of chem- 
istry and their interrelationships 

B. Application of a definition 

C. Application of a principle in situations similar to those encountered 
in a typical course 

D. Application of principles in new situations taken from everyday 

E. Interpretation of a set of data and drawing conclusions from them 

III. Ability to apply principles quantitatively by carrying out calculations 

A. Quanitative significance of chemical symbolism 

B. Balancing of chemical equations 

IV. Ability to use the scientific method 

A. Distinction between observed phenomena and their theoretical ex- 

B. Explanation of phenomena in terms of theory 

C. Presentation of the experimental evidence for a theory 

D. Identification of the assumptions underlying a given conclusion 

E. Identification of the factors that must be controlled in an experiment 

F. Identification of statements that are true merely by definition 

Such an analysis as the foregoing of behavior objectives should prove 
very useful in the preliminary stages of planning a test. Before actual con- 
struction of items begins, however, it should be accompanied by an equally 
comprehensive analysis of the content to be sampled. 

A third example will be drawn from an operational rather than a con- 
tent field. As a preliminary step in constructing a reading test, Frederick 
B. Davis (3) surveyed the literature to identify the comprehension skills 
deemed by authorities to be most essential in reading. A list of several 
hundred specific skills was made, and these were then classified in an at- 
tempt to group together skills that are closely interrelated and that have 
low correlations with skills in other groups. From this approach there 
emerged the nine groups of skills shown below. 

Davis' Classification of Reading Skills 

1 . Knowledge of word meanings 

2. Ability to select appropriate meaning for a word or phrase in the 
light of its particular contextual setting 

3. Ability to follow the organization of a passage and to identify antecedents 
and references in it 

4. Ability to select the main thought of a passage 

5. Ability to answer questions that are specifically answered in a passage 

6. Ability to answer questions that are answered in a passage but not in the 
words in which the question is asked 

7. Ability to draw inferences from a passage about its contents 

8. Ability to recognize the literary devices used in a passage and to deter- 
mine its tone and mood 


9. Ability to determine a writer's purpose, intent, and point of view, i.e., to 
draw inferences about a writer 

Once the classification of the numerous specific skills into the nine 
broader categories had been made, test items were constructed for each 
of the categories. Davis recognized that the first skill, knowledge of word 
meanings, is basic to measurement of others, and that some overlapping 
among the others is inevitable. In assembling the items into a test, he 
attempted to include the proportion of items testing each of the last eight 
skills that he judged to conform to the judgments of authorities on read- 
ing. Whether or not the skills isolated by this type of analysis are in fact 
independent and separately measurable is a question for statistical analysis. 

It should be noted that the experience gained in the writing of the test 
items often contributes to a further clarification of the test objectives, and 
that frequently it becomes desirable to alter the test outline as the item-writ- 
ing proceeds. There are few activities, in fact, that contribute more effectively 
to the clarification of objectives than the task of translating them into 
specific test behavior situations and the process of modifying and select- 
ing items on the basis of the data secured from the experimental tryout. 
In general, therefore, it is undesirable to view the original test outline as 
"frozen." Rather, the test author should deliberately strive to improve the 
table of specifications as the test is being built. Indeed, the best time to 
write the final table of specifications for the test is after the items have 
been written and tried out and the final test assembled. 

Analysis of cunkulums and textbooks 

As indicated before, the outline of a test intended to measure general 
achievement in a particular course of instruction should specify the test 
content in considerable detail. Moreover, each element of the content 
should be weighted roughly in proportion to its judged importance. When 
a teacher is constructing a test for use only in her own classes, perhaps 
she can be the sole judge of the appropriate weight for each subject- 
matter topic. Her judgment is quite likely to be influenced, whether con- 
sciously or unconsciously, by the relative emphases placed upon various 
topics in the textbooks she uses or consults. When a test is expected to 
have wide usage, perhaps nation-wide, it becomes essential that the judg- 
ments of a number of experts in the subject-matter field be solicited. Con- 
sequently, it has become standard procedure in the planning of a subject- 
matter test, especially one for a broad market, to analyze ten or a dozen 
of the more widely used textbooks in the field in order to secure a tenta- 
tive list of topics to be tested and some indication of the appropriate 
emphasis to be given to each of them. The median number of pages de- 


voted to a topic is often taken as a rough index of its importance, as 
judged by textbook writers and editors. This serves also as a crude index 
of the amount of time given to each topic by classroom teachers. 

Analysis of curriculums from representative cities and counties scattered 
throughout the United States provides another method of determining the 
behavior objectives and subject-matter content to be measured by objective 
tests. Data obtained from curriculums provide a somevv^hat more accurate 
indication of the amount of time actually devoted in the classroom to each 
topic in a given field than do those obtained by analyzing textbooks. 

Pooled judgments of authorities 

In the construction of educational achievement tests, the advice and 
assistance of authorities should almost always be sought with respect to 
the topics of the test outline and the emphasis on each. The panel of 
expert consultants should not only command wide respect; but it should 
also constitute an adequate representation of professional thought in the 
field. For example, in the construction of a set of reading tests, it would 
be unwise to consult only with men known to be primarily concerned 
with the mechanics of eye movements and their training. A test that would 
satisfy this group of experts might be quite unsatisfactory to others in the 

Ordinarily, subject-matter experts serving as consultants cannot be ex- 
pected to write test outlines from "scratch." The most profitable way to 
use consultants is usually to submit to them a tentative outline, perhaps one 
based on analyses of the objectives of instruction as stated in representative 
curriculums and of subject-matter content as indicated by up-to-date and 
widely used textbooks. This suggestion is made on the assumption that 
the person initiating the test is either himself able to formulate such an 
outline or has an assistant with the necessary familiarity with the subject 
matter. The consultants are asked to criticize or revise the topics in the 
outline, to suggest additional topics, and to weight each of the topics 
finally agreed upon. Sometimes the weights derived from curriculum and 
textbook analyses and the test constructor's subjective judgment (if he 
has sound basis for such judgment) are presented on the outline given to 
the experts for criticism, but the disadvantage is that the experts tend to 
accept such weights and to refrain from suggesting changes in them. In 
general, therefore, a test outline without any indication of the emphasis to 
be given the topics is preferable for this purpose. A request to assign 
weights to the topics is then most likely to stimulate active and thoughtful 
consideration of the problem. 


Determination of Appropriate Weights 
An illustration of the use of expert judgment in constructing a test 
outline is provided by a paper-and-pencil mechanical comprehension test. 
Careful study of handbooks and manuals for mechanics suggested that 
eight topics for which paper-and-pencil test items could be constructed 
covered the field rather thoroughly. A list of these eight topics was, there- 
fore, prepared and sent to sixteen individuals known to have a high de- 
gree of competence in mechanical comprehension or its measurement. In- 
structions accompanying the list were, in part, as follows: 

On the accompanying sheet are brief descriptions of each of eight topics. 
Please read the descriptions and then estimate what per cent of the total test 
should be devoted to each topic. Write your estimate in the parentheses 
given. . . . Your eight estimates should add up to 100 per cent. If you 
think some element of mechanical comprehension other than the eight listed 
is important in mechanical comprehension as a whole, please write a brief 
description on the line below the description of Topic 8. 

In such instructions, provision might also be made for the judges to 
add topics that they believe are not covered by those given and to indicate 
appropriate percentage weights for them, again in such a way that all 
weights assigned sum to 100 percent. 

The median percent assigned by each consultant to each topic was used 
as an indication of the contribution that each topic should make to the 
total variance of a test of mechanical comprehension. The eight topics and 
their weights were as follows: 

Topic Percent 

1. Technical vocabulary 15 

2. Tools 10 

3. Hydraulics 7 

4. Properties of materials and structures 9 

5. Uses of mechanical devices 15 

6. Mechanics 15 

7. Electricity 8 

8. Mechanical movements 21 

Total 100 

Note that the consultants were not asked to indicate the number of 
items of each type to include in a 100-item test of mechanical comprehen- 
sion. Instead, they were asked to answer a question that is more easily 
answered by laymen; namely, "How much weight should each type of 
item have in determining the total score?" Again, it should be noted that 
for inexperienced test constructors, especially if they are not authorities 
in the subject-matter field in question, a more detailed outline of content 
would be desirable before actual construction of item writing proceeds. 


Let us next consider how to translate the data into specifications regard- 
ing the number of items of each type needed to produce the desired 

It is sometimes assumed that the contribution of a part of a test to the 
total score is directly proportional to the number of items assigned to that 
part. One might casually suppose that in a test of thirty items it would 
necessarily follow that a twenty-item part would contribute twice as much 
to the total score as a ten-item part. This supposition is not necessarily 
correct, however, as can easily be shown. Let us first consider the rela- 
tionship between the number of items in a test and its standard deviation. 
The variance of a test, expressed in terms of the variances and inter- 
correlations of its component items, can be written as follows: 

oT^ = (ri2+o-2-+ • • • +o-„^-]-2((Tio-2ri2-f o-io-ans-f • • • +o-„_io-„r„_i,„). (1) 

If the intercorrelations of the items were all unity and their variances 
all alike (that is, if the items were all equally difficult), equation (1) 
indicates that the standard deviation of scores on the test would vary 
directly in proportion to the number of items included in it. If the inter- 
correlations of the items were all zero and their variances all alike, equa- 
tion (1) indicates that the standard deviation of the scores would vary 
directly in proportion to the square root of the number of items included 
in it. Both sets of assumptions are untenable in practice. In actual practice, 
the item intercorrelations are likely to be closer to zero than to unity; 
hence, it may be serviceable as a first approximation for the test constructor 
to figure that the standard deviation of a part of a test will increase a little 
more than in direct proportion to the square root of the number of items 
in it; that is, doubling the number of items in a part will make the stand- 
ard deviation of a part about one and one-half times as large. 

This is, however, a very rough approximation, since so much depends 
on the difficulty of the items added. For example, if the items added are 
all either very easy or very difficult, doubling the number of items might 
have only a slight effect on the standard deviation of the score distribution. 

The second relationship we should consider is that of the contribution 
of a part of a test to the variance of the entire test. This is given precisely 
by a modified form of equation ( 1 ) , as follows : 

0-7,2 — o-^((j-j -|- (r2f'n + • ■ 

• + a-„ri„) 

+ cr2(o-iri2 + 0-2 + • • 

■ + O-nrZn) 


+ o-„(criri„ + o-2r2„ -f- 

• • • + cr„) 



Each line on the right-hand side of equation (2) constitutes the con- 
tribution of one part to the variance of the entire test. Inspection shows 
that this contribution depends on the size of the standard deviation of the 
part and on its correlations with all other parts. A test constructor does 
not have available the necessary data to solve this equation at the planning 
stage. It is common practice, however, to make a very crude estimate of 
the probable contribution of each part of a test to the total variance by 
assuming that the intercorrelations of the parts are all zero. On this assump- 
tion, the contribution of each part becomes directly proportional to the 
square of its standard deviation. Since the square root of the number of 
items in a part is very roughly proportional to the standard deviation of 
the part and since the latter may be taken as a crude approximation of the 
square root of the contribution of the part of the entire variance, it follows 
that the test constructor, by an extremely rough approximation, may make 
the number of items in a part proportional to the desired contribution of 
that part to the entire variance. Thus, for the test of mechanical com- 
prehension for which percentage weights determined by pooled judg- 
ments of experts were presented earlier (page 167), the number of items 
on each topic to be included in a 100-item test would correspond very 
roughly to the percentage weights. It is important to emphasize the crudity 
of this approximation. If the test author has reason to believe that certain 
parts correlate highly with one or more of the others, consideration of the 
relationships in the equation above might lead him to depress the weights 
of such tests somewhat. Tryout of the test may reveal unexpected trends 
in the standard deviations of the parts and thus call for readjustment of 
the weights. It is only through the use in equation ( 1 ) of empirical data 
secured from a representative tryout that one can determine with high 
dependability the true weights of the various parts of the test or of the 
various types of items. 

It should now be more clearly apparent why the statement was earlier 
made that it is not necessary or desirable to attempt to adhere rigorously 
to the weights or distribution of items in the various categories in the 
original table of specifications. In the first place, the weights themselves 
represent a series of compromises among the authorities consulted, and 
seldom correspond closely to the weights suggested or preferred by any 
individual authority. The weights are thus merely estimates of the "best" 
set of weights, and can seldom be regarded as highly reliable estimates. 
In the second place, the item writers may prove incapable of building 
convincing items for some categories in the numbers specified, but may 
produce a surplus of very good items in other categories. This sometimes 
may suggest that the former categories were originally none too well con- 


ceived, or that they are concerned with unmeasurable or unteachable out- 
comes or traits, in which case the original weights might be somewhat 
reduced for the former and increased for the latter categories. Again, the 
result of the tryout may indicate that the authorities were somewhat 
unrealistic concerning actual achievements of the pupils to be tested, or 
may suggest an earlier or later grade placement of some of the outcomes 
tested. Finally, it must be recognized that the number of items assigned 
to the categories is only a very rough indication at best of the weights 
actually given to these categories in the total test score. 

The foregoing does not imply that one may be careless about the dis- 
tribution of items, or that careful planning is not worth while. On the 
contrary, it is highly desirable to prepare a table of specifications with the 
utmost care, and departures from the table and from its weights should 
be allowed only if good reasons for doing so present themselves. Such 
reasons, however, will often be encountered, and it would be unwise to 
disregard them or to assume that nothing may be learned from the tryout 
of the items that will lead to improvements in the table of specifications. 

Determining Test Length 

The number of items that should be included in the final form of a 
test is, from an idealistic standpoint, determined by the purpose of the 
test or the uses to which it is to be put, and by the statistical characteristics 
of the items. If important decisions regarding individuals are to be based 
on the test, it must be more reliable and hence must contain more items 
than would be the case if it were to be used only for crude group compari- 
sons. Again, if the items of the test are very homogeneous in nature, the 
optimal number is lower than for highly heterogeneous item content or 
form. A test for which several part scores are to be obtained will require 
more items in order to produce a given degree of reliability of each part 
score than will a test on which only a single over-all score of the same 
degree of reliability is needed. These are all important considerations in 
setting test length. 

Another factor influencing test length is the amount of time likely to be 
available or convenient for the administration of the test. Often, and 
particularly for educational testing, there is likely to be a maximum length 
of testing period related to administrative factors such as the usual length 
of a class period. Disregard of the exigencies of the practical situation may 
mean that an otherwise superior test will remain unused. 

Given the allotted time, the test constructor, on the basis of experience 
with the type of material contained in the test, estimates the number of 
items likely to be appropriate. The number of items feasible for a particu- 


lar period of time varies both with the form and content of the items and 
with the type of behavior they attempt to elicit. More specifically, the 
mere length of each item, insofar as it is related to reading time, will 
affect the number of items that can be used fruitfully. The vocabulary 
level of an item and the difficulty of its sentence structure bear directly on 
its reading time. Some mathematical items require much more time than 
others for computation. Items that tap the higher thought process take more 
time than those dependent on rote memory. 

On the basis of the foregoing factors, together with the proportion of 
the total test that should be devoted to each type of item (as established 
in the test outline), the test constructor decides how many of each type of 
item the test should contain. Preferably, and especially if the test is to 
have widespread use, these numbers should be regarded as tentative until 
the test has been tried out. After the initial tryout, available data on 
reliability or validity can be used in modifying the original numbers of items 
of each type or the length of the total test. Ways to determine experimen- 
tally the optimum length of a test are described in chapter 10 (pages 

Determining the Number of Items To Be Constructed 

FOR Tryout 

The number of items which should be constructed for tryout is always 
considerably larger than the number needed for the finished test. Review 
of test items, subsequent to tryout, by technical and subject-matter con- 
sultants on the basis of the statistical analysis of the tryout data commonly 
brings to light a number of items with such serious shortcomings that they 
cannot be salvaged. The planning of the number of items to be con- 
structed should take cognizance of the anticipated item mortality. The 
loss will vary, of course, with the experience and competence of the item 
writer. It will also depend upon the type of item. Fewer surplus items 
are commonly needed, for example, on arithmetic fundamentals and on 
grammatical usage than on preferred diction. With some types of items, 
two or three times as many items as will eventually be needed should be 
constructed; in other cases the mortality may be only 10 to 20 percent. 
These problems are considered in detail in the chapters on tryout of the 
test, on analyses of the tryout data, and revision after tryout (chapters 
8 and 9). 

The number of items appropriate for inclusion in a test also depends 
on the extent to which it is intended to place a premium on speed of per- 
formance. A common tendency among test constructors is that of introduc- 
ing a speed factor into tests originally planned as measures of level of 


performance. Most tests of comprehension in reading, for example, yield 
scores that are a mixture of speed and level of comprehension. This matter 
should be given careful consideration when a test is planned, and the 
number of items included in the final form of the test should be dependent 
on a conscious decision to introduce speed of response or to eliminate it 
for practical purposes. (See chapter 9, pages 312-15.) The proposal made 
many years ago by Flanagan to obtain "power" scores by means of the 
repeating-scale technique should be given careful consideration when a 
test is planned (4). This technique has been used much less frequently 
than occasion for its use has arisen. 

Determining the Time Limits for the Test 
The problem of determining the amount of time in which the test is to 
be administered is ordinarily inseparable from that of determining the 
length in terms of number of items. Sometimes, as has already been sug- 
gested, the test is planned to fit a predetermined time interval, such as a 
45 -minute class period, in which case the problem is to estimate how many 
items will yield the maximum validity and reliability for that time interval. 
In other cases, the test is planned to yield a given minimum validity or 
reliability, in which case the problem is to estimate simultaneously both how 
many items and how much time will be needed to meet the standards set. 
This problem is considered fully in chapter 10, pages 336-38. 

Planning the Types of Hems 

Before test construction is under way, and ordinarily at a fairly early 
stage in planning a test, decisions must be made concerning the types of 
items to be included in a test. 

Relation to Scoring Facilities 
A major factor affecting such decisions is the degree of scoring objectiv- 
ity considered necessary or desirable. Tests intended for extensive usage 
are almost without exception made relatively objective; and among the 
various forms of so-called objective tests the more objective, such as the 
"multiple-choice," are to be preferred to the less objective, such as the 
"completion." This is true whether hand-scoring or machine-scoring is con- 
templated. If the tests are to be scored by machine, however, then selec- 
tion of the highly objective types of items is clearly mandatory. 

Relation to Administrative Facilities 
In addition to the scoring facilities, the probable quality of the test 
administration has a bearing on the types of items to be selected. If the 
test is likely to be given by different persons, and by persons with little 


training or experience in test administration, then it behooves the test 
planner to take precautionary measures to minimize the difficulty of ad- 
ministering the test. Inclusion in the test of items of only one type greatly 
simplifies the administration of the test, especially when a single over-all 
time limit applies. Use of a single item form reduces the amount of time 
necessary for giving directions and can help to eliminate the possibility 
that failure to understand directions may invalidate test scores. 

Use of Few Versus Many Types of Items 

Some have argued that use of several different kinds of items lends 
interest to a test through its variety. Hovc^ever, the interest value of a test 
depends primarily upon the quality of the items, rather than upon their 
external form, and if the items are competently constructed, even though 
all of one form, there will usually be no problem of maintaining interest. 

The form of items used must be suitable in relation to test content. 
Some test constructors believe (and with some justification) that certain 
types of content or particular topics lend themselves more readily to par- 
ticular item types than to others. It may even be felt that there should be 
sufficient flexibility in the selection of item forms that the particular item 
topic is allowed to determine the form, which hence should not be speci- 
fied in advance. Probably most specialists in the field of test construction 
would regard this as an undesirable extreme of a position that in moderation 
is reasonably acceptable. 

On the whole, arguments for use of as few forms as possible, and 
preferably only one, for a given test or area of subject matter probably 
outweigh arguments for variety. 

Achieving Appropriate Item Difficulty 

Relation to Test Purpose 

Test planners should give careful attention to the desired level and 
range of difficulty of test items and discuss this topic with the persons 
who are to write the items. Technical considerations pertinent to what con- 
stitutes the optimal level and range of difficulty in relation to the purpose 
of the test are treated in chapter 9. Once it is decided what level and dis- 
tribution are to be sought, the task is to attempt to construct items that will 
yield the desired characteristics. 

Factors Affecting Item Difficulty 

As is made clear in chapter 9, an index of item difficulty may be obtained 
by estimating the percentage of examinees who know the answer to it. 


Primary factors affecting the difficulty of an item are the nature of its con- 
tent and the type of behavior it requires of the examinee. A test construc- 
tor may quite readily judge that an achievement test item on a topic to 
which eighth-graders have not been introduced will be very difficult for 
them or that an item requiring a new application of a scientific principle 
will produce more failures than one based on simple recall of the principle. 
Such judgments are likely to be helpful. 

Apart from intrinsic content, however, other factors influence item diffi- 
culty, often in very subtle ways. Unusual vocabulary may markedly, albeit 
unintentionally, influence responses to an item. Awkward sentence structure 
and undue formality in the style of the language used often have unpre- 
dicted effects. A shift from use of third person to first or second may make 
an item significantly easier. Even such apparently extraneous factors as the 
form of the item and the directions to the examinees may affect item diffi- 
culty. Matters such as these should be brought to the attention of the item 
writers, in addition to the more obvious considerations related to the in- 
trinsic nature of the subject matter and type of behavior involved. 

Using Judgments and Statistical Indices of Difficulty 

The use of subjective judgment in estimating item difficulty at the stage 
of item construction is to be encouraged. Such judgments, when based on 
all available experience, are distinctly helpful in leading to the construc- 
tion of items of the desired difficulty. They are not ordinarily sufficiently 
accurate, however, that maximum efficiency of the finished test can be 
secured on this basis alone. Except in rare situations, the items must be 
tried out on a sample representative of the universe for which the final test 
is intended. As explained in chapters 8 and 9, selection of items for the 
final form of a test should systematically take into account the difficulty 
level of the items. In order to secure anything like maximum efficiency of 
measurement, it is necessary to adjust the distribution of difficulty of the 
items in a test so that it will be appropriate to the subjects being examined 
and to the purpose or purposes of the examiner. This final adjustment is 
accomplished by using difficulty indices computed on the basis of data ob- 
tained during the tryout of the test. 

Sometimes the tryout data reveal that not enough items of a given level 
of difficulty were constructed. The items may prove, in general, to be too 
difficult or too easy for the subjects in the experimental group. The data 
may show that an efficient measuring instrument cannot be constructed by 
using only the items that were tried out. Careful attention to the most 
appropriate range of difficulty while the items are being written and before 
the tryout form is assembled helps to preclude this possibility. All items 


should be keyed and criticized by subject-matter experts and test techni- 
cians. Items that are apparently outside the range of difficulty that will be 
required should be discarded. If there seem to be too many items at one 
level of difficulty and not enough at another level, that condition should be 
corrected before the tryout form is prepared for administration. Careful 
planning and editing of the experimental form of a test can save a great 
deal of money and trouble and permit construction of tests that approach 
maximum efficiency. Unfortunately, the painstaking care and psychological 
insight needed to produce tests of high quality and high efficiency are often 
lacking in the process of test construction, perhaps because they cannot 
be routinized as can item analysis techniques and other statistical proce- 

In large test construction agencies, a test planner sometimes has avail- 
able a large reservoir of items whose difficulty is known from previous 
tryouts. In such a situation, the test outline can advantageously include the 
plan for the range and distribution of difficulty indices for each subject- 
matter area or each type of behavior included in the outline. Then the 
selection of items can be done systematically to insure not only the desired 
content but also the desired difficulty characteristics. 

Planning Other Test Development Operations 

In the introduction to this chapter it was stated that planning is essential 
throughout a test construction project. Once one has defined the purpose 
of a test, prepared an outline of its content, selected the type or types of 
items to use, and taken precautionary measures to insure appropriate item 
and test difficulty, he can still avoid countless mishaps by anticipating 
them. This section, covering practically all of the remaining steps in a 
typical test development project, points to additional problems to which 
the test planner should give early attention. 

Construction of Items 

The question of who is to construct the items needed for a particular 
test warrants attention. In some instances, where the subject matter con- 
cerned is fairly simple, the most feasible plan is to have the items con- 
structed by a specialist in test construction who either knows or can readily 
acquire the requisite subject-matter knowledge. If test technicians unfamil- 
iar with a more complex field of knowledge or skill attempt to construct 
items in that field, however, the test as a whole is very likely to be faulty. 
Some of the individual items may not have a defensible answer or may have 
more than one defensible answer. Perhaps more serious, the items are 
likely to depend to too great an extent on material readily available in 


books, and too little on interpretations that would evidence real under- 
standing. In other words, the resulting tests are likely to be open to the 
charge of being "textbookish." As the subject-matter field becomes more 
complex and specialized, then, it becomes more and more important that 
the item constructor have had extensive training and experience in the 

In some situations, there may be available for item writing persons who 
not only know the content area thoroughly but who also have had training 
and experience in the specialized techniques of item writing. This is 
probably the exceptional case, however. It is ordinarily necessary to pro- 
vide the subject-matter specialist with specific training and supervision in 
item writing. The necessity for providing such training again calls for 
decisions on such matters as how much training is essential, the proper 
timing of various aspects of the training, the extent to which it can be 
conducted in group sessions, the amount of time that can profitably be 
devoted to discussion and joint revision of individual items, and so on. 

The rate of item construction is another matter on which planning is 
required. No single rate of construction applies to all types of items or 
to all item constructors. Some item forms require more time than others. 
True-false items can be written at a more rapid rate than multiple-choice 
items, for example. There are also wide differences attributable to the sub- 
ject-matter area: vocabulary or arithmetic items, for example, may be con- 
structed by an experienced item writer at the rate of, say, fifty or more 
a day, while a rate of from three to ten a day may be entirely acceptable for 
the production of items for a test of ability to interpret reading materials 
at an advanced level. For a given item form and area of subject matter, 
furthermore, some item writers are capable of producing items several times 
as rapidly as others. 

The deadline for completion of the test must be taken into account 
together with the expected rate of item production in deciding how many 
item constructors are needed. Sometimes the deadline is such that a number 
of item constructors must be used, even though this adds to the problem 
of training and creates need for coordination of their work. 

Apart from the urgency of test deadlines, there is often a further reason 
for having several persons construct items for a test. Even with the use of a 
carefully detailed outline of test content, a set of test items constructed by 
one person is likely to be influenced by whatever biases he may have in 
the content area; and such biases may escape the later reviewers or be diffi- 
cult to correct. Use of several item constructors with different backgrounds 
of training and experience leads to the reflection in the test of somewhat 
different points of view or emphases within the subject-matter area. 


There is also need for planning in order to be sure that any source 
materials needed can be available at the time they are needed. This may 
seem an obvious point, but it is one that is too frequently overlooked. 

Review of Items 

As a general rule, test items should be reviewed, before tryout on any 
sizable number of subjects, from three points of view: the accuracy and ap- 
propriateness of their subject-matter content, their technical merits apart 
from content, and their "editorial" quality. Effort spent in intensive re- 
view of these three types will obviate the need to try out unusable items 
and will improve the general quality of the items tried out. 

Sometimes one person has skill in all three areas and thus is able to 
combine the three types of review. More frequently, different persons are 
used for the three quite different purposes. No hard and fast rules can be 
set down as to how many reviewers for each function should optimally 
be used. The numbers depend to a considerable extent on the type of items 
and the particular subject-matter area and also on the persons involved. 
Individuals who earn their livelihoods as test constructors, for example, 
should develop skill in handling such editorial matters as punctuation, 
spelling, diction, uniformity of style, and so on, so that a review by only 
one additional person from an editorial standpoint should suffice. Perhaps 
two test technicians might profitably review the items from the technical 
point of view. Ordinarily a larger number of persons should review the 
subject-matter content intensively — somewhere between, say, three to ten, 
depending on the complexity of the area. Often it is desirable to bring 
the subject-matter reviewers together to discuss points of difference and 
attempt to achieve a final version of each item that meets all objections. 

Records on Each Item 

Before item construction is begun, the test planner must decide what 
records are to be kept on each item. If the first drafts of items are to be 
retained, even if only until a test is in print, having them all on sheets of 
paper or cards of a single size will facilitate handling and filing. If re- 
viewers' comments are to be retained, provision for the comments to be 
recorded in a uniform way will be helpful. If a record of any written source 
used in construction of the item is desired, a plan for this record should be 
made before the item reaches its first draft; otherwise the book may have 
been called back to the library. If a record of who constructed each item is 
desired, in case different persons are working on a test, his identity should 
be recorded on the initial draft and later transcribed to whatever final 
record form is maintained for the item. Plans also need to be made for 


keeping a record of item use and whatever statistical data are desired for 
each item. More detailed treatment of item record systems is given in 
chapter 9. 

Tryout and Analysis of the Test Items 

Chapter 8, devoted to tlie tryout of tests, makes clear that many ques- 
tions about tests can be answered with assurance only on the basis of try- 
outs. The topic is introduced here because only through attention to plan- 
ning for the tryout and analysis of test results can there be any assurance 
that the tryout will be adequate. The purposes of the tryout should be 
outlined in some detail, and for each the question asked whether or 
not the tryout is being planned in such a way as to serve the purpose. 

One feature of the tryout that warrants expert consideration is the selec- 
tion of the sample on which the tryout is to be conducted. This attention 
to the sample should cover not only whether the sample is reasonably repre- 
sentative of the examinee group from the standpoint of abilities but also 
whether its members are likely to have or can be made to have about the 
same degree of motivation as the examinees. 

Another feature of the tryout that should be mentioned here relates to 
timing. The test planner must be sure to allow ample time for administer- 
ing and scoring the tryout form, for analyzing the results statistically, and 
for processing whatever revisions are indicated by the tryout. This planning 
requires attention both to manpower resources and to deadlines for the 
final form of the test. 

Compilation of the Items 

A group of items is not necessarily a test. Once the individual items have 
been constructed, the problem remains of selecting from among those that 
survived the review process and the tryout those which are to constitute the 
test, and of arranging the selected items into an appropriate order. 

One problem often arising in test compilation is that of avoiding un- 
desirable overlapping among the items. There ordinarily should not be 
two items so closely related that an examinee who can answer one correctly 
automatically can answer the other (unless the test constructor may wish to 
weight the factor tested more heavily than the other items ) . Nor should one 
item contain a clue to the correct answer to another item. Such overlapping 
of item content reduces the reliability of the test. If an outline of test 
content has been prepared in great detail, the possibility of the first type 
of overlapping should be minimized; it still may occur, however. And no 
matter how specific the test outline, the second type of overlap is probable, 
particularly if several persons working independently have constructed the 


items and if all items have not been reviewed within a relatively short 
period of time. 

The test compiler must give attention to the adequacy of the coverage 
of the subject-matter field by the selected items. Again, to the extent that 
the test outline is comprehensive, this problem will be reduced. But if the 
outline presents only a few broad categories, a great share of the responsi- 
bility for insuring that the test will serve its purpose rests with the com- 

In the assembling of items into a test, various questions relating to 
the order of the items must be considered. Occasionally some of these 
questions are anticipated in the preparation of the test outline, a special 
form of which may be prepared to show the order in which the several 
divisions of subject matter are to be tested. 

Various ways of grouping items have been tried. The nature of the 
items and the character of administrative conditions usually are the deter- 
mining factors in deciding how to group items. If the items are homo- 
geneous in content and in difficulty, so that they are essentially interchange- 
able, the order of the items is inconsequential. 

If a test of heterogeneous content is to yield only one score, the items 
of each type can be grouped together in subtests and a separate time limit 
provided for each group of items or subtest so as to insure that each type 
of item receives its proper share of attention from the examinees. Ordinarily 
the items within each subtest would be arranged in order of ascending diffi- 
culty. The order of the subtests might be based on their relation to a logi- 
cal organization of the subject matter. Perhaps, for example, the broader 
or more general areas should come first, followed by the more specialized 

If many examinees cannot finish the entire test in the time limit, and 
it is impracticable to make use of a separate time limit for each group of 
items, it may be desirable to present first the easiest items of type 1, fol- 
lowed by the easiest items of type 2, etc., until the easiest items of all 
types have been presented. There will then follow a set of moderately easy 
items of each type, ending with the most difficult item of each type. This 
is the so-called spiral omnibus arrangement. One of its disadvantages is 
that the sophisticated subject is likely to skip over the segments he thinks 
he cannot do quickly or accurately, while the more conscientious or less 
sophisticated but equally apt subject plods along, taking the items as they 
come. Naturally, the sophisticated subject is likely to get a higher score, 
which means that the validity of the test suffers because of the introduction 
of unwanted variance pertaining to personality characteristics and experi- 
ence with tests. What is worse, from one point of view, is that the sophis- 


ticated subject realizes what he has done and knows that he has "got away 
with it." Consequently, his opinion of objective tests is likely to be mildly 
contemptuous, to say the least. 

The spiral omnibus arrangement has other disadvantages, too. Among 
these are the fact that the examinees have to keep changing their mental- 
set from one type of item to another. It may be uncommon for this sort of 
performance to be required in real life situations. If, however, one's pur- 
pose is to attempt to test "mental flexibility," a spiral omnibus test might 
be useful. Needless to say, there is no way to obtain satisfactory part scores 
from a spiral omnibus test for research purposes. Part scores derived from 
this sort of test must ordinarily be regarded as expedients that are useful for 
some practical purposes. A test constructor ought to be sure that the use of 
separate time limits for different parts of a test or use of the repeating-scale 
technique is impracticable before he decides to forego the use of groups 
of more or less homogeneous items. 

Directions to Examinees 

It is usually desirable to prepare the directions to the examinees for the 
finished test, in at least rough form, before the items are constructed. Early 
preparation of these directions will focus attention of the item writer on 
the problem of adapting the items and item forms to the background and 
experience of the examinees, and will lead to clearer thinking about test 
length and time limits. Certainly these directions should be prepared in 
advance of the tryout, since it is very important that the same directions be 
used in the tryout as in the finished test. The preparation of these directions 
is considered at length in chapter 10, pages 351-65. 

Review of the Test as a Whole 

Once the proven test items have been assembled, the time limits estab- 
lished, and the instructions to the examinees prepared, the test should 
again be reviewed from three points of view : the technical, with particular 
attention to principles of measurement, including those relating to item 
form; the subject matter, with attention to appropriateness of content and 
especially to the accuracy of the scoring key; and the editorial, this time 
with special attention to appropriate over-all format, to editorial consistency 
from one item to another, and to the proper relation between the instruc- 
tions to examinees and the test itself. 

As in the case of the review of individual items, one person may possess 
competency in more than one of these fields. Ordinarily, however, the test 
planner will have to arrange for different persons to consider the test from 
the different points of view. The number of such reviewers and the inten- 


sity of their review will depend in large part on the nature of the content 
and the excellence of the previous reviews. 

Reproduction of the Test 

The way in which a test is to be reproduced should ordinarily be given 
some consideration even before item writing begins. The method of re- 
production to be used sometimes determines what types of items may be 
employed. For example, if the test is to be lithoprinted, liberal use of photo- 
graphs, charts, tables, maps, etc., will not materially affect the cost of the 
test, and their use may be especially encouraged. If the test is to be repro- 
duced by letterpress printing or by Mimeograph, however, use of visual 
materials will have to be curtailed for reasons of cost and other practical 

The method of reproduction to be used will also affect the preparation of 
the final copy. If a test is to be reproduced by a photo-offset process, for 
example, the final copy must be exactly as it is to be reproduced (except, 
of course, for size, since reduction or enlargement is possible) . This means 
that more expert typing is required than for a test to be set in type, for in 
the latter case interlineations or other types of corrections that the printer 
is to make can be indicated directly on the copy. If, on the other hand, a 
test is to be reproduced by Mimeograph or Ditto or a similar process, as 
might be the case for a test for use with a single class, then the final copy 
often can be made from item cards directly onto stencils. 

The test planner must give preliminary consideration to any drafting 
work that the final test will require and govern his plans according to the 
type of reproduction of the test that appears most practical in the particular 
situation. For example, if all test pages are to be multilithed, the test 
planner must decide whether all drawings are to be drawn in the size in 
which they are to appear or whether they are to be reduced before being 
fitted into place on the page or whether the entire page is to be reduced. 

Many of the details of test reproduction to which the test planner will 
need to give attention are given more complete consideration in chapter 1 1 . 

Scoring the Test 

Before test planning can progress very far, a decision has to be reached 
as to whether the scoring is to be subjective or objective. Since this book 
is addressed primarily to the question of developing objective tests, no 
further consideration will be given to subjective tests except to note that 
they are much more difficult to score and require even more careful plan- 
ning than objective tests in attempts to reduce scorer unreliability. 

One of the first questions regarding the scoring of objective tests that 


arises is winether the scoring is to be by hand or by machine. The answer 
will depend partly on the availability of hand-scorers as against the avail- 
ability of scoring-machine facilities and partly on the predicted costs of the 
two types of scoring. It is clear that there is a limit to the number of tests 
to be scored below which machine-scoring is not economical. In other cases, 
where the number of tests to be scored is very large, it is probably a moot 
question whether machine-scoring is cheaper than hand-scoring, although 
many persons believe this is true. In some instances, good design of answer 
sheets and intensive training and high motivation of scoring clerks may 
make hand-scoring more economical than machine-scoring. For any sizable 
testing program, this question is worthy of careful scrutiny before a deci- 
sion is reached. 

Occasionally a test constructor may believe that the type of content with 
which he is dealing does not lend itself readily to the multiple-choice form 
demanded for machine-scoring. In such a case it may be decided that the 
nature of the material rather than the possibility of machine-scoring should 
determine the item form to be selected. 

Whether a test is to be scored by machine or by hand, use of a separate 
answer sheet is usually advantageous from the standpoint of scoring econ- 
omy alone. Another factor that has bearing on the use of separate answer 
sheets is the type of test material or the nature of the ability being tested. 
If, for example, performance on a test is known to depend largely on 
perceptual speed, use of a separate answer sheet might introduce additional 
factors that would have unintended effects on test scores. 

If hand-scoring is to be used, the type of scoring key to be applied must 
be determined. There must also be decisions on whether right answers or 
wrong answers are to be marked or whether the scorer is simply to record 
a score without marking the answers. The stage in the scoring at which 
any weighing and combining of subtest scores are to be accomplished must 
not be neglected. 

The test planner, if he is also responsible for scoring a large number 
of tests, must plan how many scorers can be used most effectively, how they 
should be selected, and what training and supervision they will require. He 
must also plan how the scores are to be checked. The whole problem of 
scoring the test is thoroughly discussed in chapter 10, pages 365-413. 

Construction of Alternate Forms of a Test 

If it can be anticipated that several forms of a test are to be needed, 
plans for their construction should be made when the construction of the 
first form is being planned. 


The test planner must determine what definition of comparability he 
is going to adopt or what he is going to accept as the minimum essentials 
for comparability. He would ordinarily want to give attention to whether 
the two forms test the same functions and whether they yield score distri- 
butions of the same type and with the same central tendency and disper- 
sion ( i ) . He would need to decide to what extent he could try out the test 
in a preliminary edition to determine whether these conditions were met or 
what adjustments would be desirable. 

Regardless of the opportunity for test tryout, it may be said that tests 
intended to be comparable should be based on the same outline of content. 
The more detailed the breakdown of subject matter in the outline, the more 
it facilitates construction of comparable forms. If the outline is so detailed 
that it indicates the content and behavior to be tested by each item and 
perhaps even the difficulty of the item, then two tests constructed on the 
basis of the outline will contain pairs of items that are more or less inter- 
changeable insofar as the judgment of the item constructor is concerned. 

Construction of comparable forms by attempting to construct pairs of 
items of identical difficulty and testing identical abilities raises the ques- 
tion of just how similar the items within each pair should be. Without an 
extended discussion of this question, the best answ^er that can be given is 
that the members of the pair should not be so related that knowledge of 
the answer to one would immediately tell a person the answer to the other, 
but they should be so similar that in general a person able to answer one 
is likely to be able to answer the other. 

Before tryout, the decision on comparability calls for skilled judgment 
on the part of the test constructor. There is always the danger that the 
items so constructed will be so similar that exposure to the one will be of 
assistance in answering the other. This is not important if the same persons 
are not going to take the same forms; but if this were not likely, there 
would be little necessity for more than one form in the first place. 


This chapter is best ended with a restatement of its purpose. In a very 
real sense this entire book is a book on test planning. Many of the topics 
that have been considered in this chapter will be taken up again in much 
higher detail in later chapters. This chapter, then, has merely provided an 
introduction to, or a preview of, what follows. It has been intended pri- 
marily to orient the reader, to help him better to view all phases of test 
construction in their proper relation to one another, to recognize the need 
for careful coordination of the many separate phases, and to appreciate 


the great importance of anticipating all subsequent steps in test construc- 
tion as each next step is taken. It may serve some of these purposes better 
if it is read both before and after the other chapters are read. 

Selected References 

1. Adkins, Dorothy C, et at. Construction and Analysis of Achievement Tests. Wash- 
ington: Government Printing Office, pp. 202-6, 1947. 

2. AsHFORD, T. A. "The College Chemistry Test in the Armed Forces Institute," Journal 
of Chemical Education, 21: 386-92, 1944. 

3. Davis, Frederick B. "Fundamental Factors of Comprehension in Reading," Psycho- 
metrika, 9: 186, 1944. 

4. Flanagan, J. C. "A Proposed Procedure for Increasing the Efficiency of Objective 
Tests," Journal of Educational Psychology, 18: 17-21, 1937. 

5. Vaughn, K. W. "The Interpretation and Use of the Professional Aptitude Test," 
Graduate Record Office Bulletin, 1-16, 1947. 

7. Writing the Test Item 

By Robert L. Ebel 

State University of lotva 

Collaborators: Dorothy C. Adkins, University of North Carolina; 
Louise Witmer Cureton, University of Tennessee ; Charlotte Croon 
Davis, Cooperative Test Service (formerly) ; Frederick B. Davis, Hunter 
College; W. W. Turnbull, Educational Testing Service 

Any test consists of a number of tasks to be performed by the 
examinee. Some of these tasks are scored as indivisible units. Others are 
subdivided for scoring purposes. 

An "item" may be defined as a scoring unit. An "exercise" may be de- 
fined as a collection of items that are structurally related. For example, a 
matching exercise may consist of five items. A reading passage and the items 
based upon it also constitute a test exercise. An "objective" item or exercise 
is one that can be scored by mechanical devices or by clerks who have no 
special competence in the field. 

The present discussion of item writing is directed primarily toward ob- 
jective items and exercises used in paper-and-pencil tests of educational 
achievement. However, many of the suggestions made in this chapter will 
apply to other types of paper-and-pencil tests. 

Item writing is an art. It requires an uncommon combination of special 
abilities. It is mastered only through extensive and critically supervised 
practice. It demands, and tends to develop, high standards of quality and a 
sense of pride in craftsmanship. 

Item writing is essentially creative. Each item as it is being written pre- 
sents new problems and new opportunities. Just as there can be no set 
formulas for producing a good story or a good painting, so there can be 
no set of rules that will guarantee the production of good test items. Prin- 
ciples can be established and suggestions offered, but it is the item writer's 
judgment in the application (and occasional disregard) of these principles 
and suggestions that determines whether good items or mediocre ones will 
be produced. 

Those who have not tried to write objective test items to meet exacting 
standards of quality sometimes fail to appreciate how difficult it is to write 
such items. The amount of time that competent item writers devote to the 



task provides one indication of its difficulty. Adkins has pointed out that 
experienced professional item writers regard an output of five to fifteen 
good achievement test items per day as a satisfactory performance.^ This 
contrasts sharply v^ith the widely held notion that any good instructor can 
produce an acceptable test in an evening or two. Further evidence concern- 
ing the difficulty of item writing is provided by the amount of money that 
critical test producers appropriate for the work. It is not uncommon for 
item writers to receive $2.00 or more for the production of a single good 
test item, or to be paid at the rate of from $2.00 to $4.00 per hour. The cost 
of a finally reviewed and approved test item may run from perhaps $3.00 
to $10.00, depending on the content involved and the care exercised. 

Extensive use of statistical methods for the analysis of responses to test 
items has seemed to imply that test production can be made a statistical 
science in which the skill of the item writer is of secondary importance. 
This notion is based on a misconception of the role of item analysis. Item 
analysis data do indeed often call attention to specific weaknesses within 
otherwise good items, and thus provide clues by which an ingenious item 
writer can make improvements. Such analyses may also make possible elim- 
ination of some of the weak items from a group, but, under usual cir- 
cumstances, the amount of improvement that can be effected by this 
process is slight. The appropriate role of item analysis is fully discussed in 
chapter 9. For the present it is sufficient to observe that the analysis of 
test items in no way lessens the necessity for skill and care in the original 
writing of them. 

Requirements for Writing Good Items 

The combination of abilities required for successful writing of educa- 
tional achievement test items can be specified easily in general terms. It is 
much more difficult actually to find persons who have these abilities. 

First, the item writer must have thorough mastery of the subject matter 
being tested. The term "mastery of subject matter," as here used, has 
broad connotations. Not only must the item writer be acquainted with the 
facts and principles of the field, but he must be fully aware of their im- 
plications, which is to say that he must understand them. He should be 
aware of popular fallacies and misconceptions in the field. This is particu- 
larly necessary in the construction of items that provide suggested re- 
sponses and therefore require incorrect responses which appear plausible 
to the poorer examinees. 

^ Of course, certain types of items are much easier to prepare than others. Ten ac- 
ceptable vocabulary items could probably be written in the time often required to produce 
one satisfactory item intended to measure a complex understanding. 


Sometimes test items are written on the basis of collaboration between 
test technicians and subject-matter experts. While this procedure is not 
ordinarily as productive of good items as is the work of a single person who 
fortunately combines test competence and subject-matter expertness, it 
yields far better items than would be produced by either specialist working 
alone. In practical test construction, collaboration of this sort is very helpful. 
The effectiveness of the collaboration depends not alone on the degree of 
competence of each specialist, but also upon the extent to which each 
shares a general background in the specialty of his partner. 

Second, and of utmost importance, the writer who prepares items for 
use in tests of educational achievement must possess a rational and well- 
developed set of educational values (aims or objectives) which are not 
mere pedagogical ornaments, but which so permeate and direct his thinking 
that he tends continually to seek these values in all his educational efforts. 
(See chapter 5, pages 121-41.) It is difficult, if not impossible, for one 
whose sense of values is inadequate or inoperative to produce good achieve- 
ment test items consistently. He may have at his disposal detailed syllabuses 
describing course content. He may be well acquainted with what goes on in 
typical classrooms where certain principles and abilities are being taught. 
But if he is not clearly aware of the educational values that should be direct- 
ing the teaching and learning, he is almost certain to emphasize the super- 
ficial at the expense of the essential. 

Third, the item writer must understand, psychologically and education- 
ally, the individuals for whom the test is intended. He must be familiar 
enough with their probable levels of educational development to adjust 
the complexity and difficulty of his items appropriately, and to know what 
will constitute plausible distracters for multiple-choice items. He must 
have enough insight into their probable mental processes when confronted 
with various types of questions to avoid ambiguity, irrelevant clues to 
correct responses, or the measurement of extraneous abilities. 

Fourth, the item writer must be a master of verbal communication. Not 
only must he know what words mean and insist on using them with precise 
meanings, but he must also be skilled in arranging them so that they com- 
municate the desired meaning as simply as possible. Always, he must be 
critically aware of various possible interpretations which the examinee may 
make of the words and phrases in the item. It is probably true that no 
sentences are read with more critical attention to meanings, expressed and 
implied, than those which constitute test items. 

Fifth, the item writer must be skilled in handling the special techniques 
of item writing. Obviously he needs to be familiar with the types and varie- 


ties of test items and with their possibiUties and limitations. Obviously he 
needs to know the general characteristics of good items and needs to be 
aware of the errors commonly made in item writing. But excellence in item 
writing demands more than this. It demands imagination and ingenuity in 
the invention of situations that require exactly the desired knowledge and 
abilities. It demands ability to identify the crucial element in each problem 
situation so that the corresponding item will be as direct and concise as 
possible. Most of all, it demands the skill and judgment that come only 
with experience. In test construction, as in other fields, the author must 
usually learn to write by writing. 

In consideration of these requirements, several things should be clear. 
The process of constructing good test items is not simple. Not all individ- 
uals are equipped to master it easily. The abilities needed are too deeply 
rooted and too slow of growth to be produced in a short period of time. 
Manuals and rules may provide useful guides and helpful suggestions for 
item writing, but there are no automatic processes for the construction of 
good test items. Even an item writer who possesses the needed abilities will 
find that his success varies with the amount of energy and time which he is 
willing to devote to the task. Recognition of the skill and pains which go 
into the production of a good test is a prerequisite to improvement in item 

The high standards for item writing implied by this discussion have 
not been met generally in the past. A very large number of educational 
tests have been produced by item writers who lack the qualifications sug- 
gested, and whose tests consequently fall short of the standards set. This 
situation is likely to continue in some degree for an indefinite future period. 
Meanwhile, in the interests of better testing, it is desirable that ideal 
standards as well as present shortcomings and obstacles be recognized. Con- 
sistent efforts must be made to overcome the faults and approach the ideals. 
Not every item writer can hope to possess ideal qualifications, but each can 
improve his tests by rational application of the specific suggestions offered 
in the remainder of the chapter. Furthermore, those who direct test con- 
struction agencies must recognize more fully the paramount importance of 
the item writer in the whole process, and must offer rewards and facilities 
that will attract to the task of item writing the very highest level of intellec- 
tual competence. 

Previous Studies of Item Writing 

The problems of item writing have not received the attention they de- 
serve in the literature on testing. There are abundant references on the 
controversial aspects of the testing movement, on the techniques of statisti- 


cal analysis of test data, and on the applications of test scores in guidance, 
placement, and evaluation. But the basic problem of writing good items 
has been neglected. An unpublished survey recently made by the writer 
of nearly one hundred and fifty periodical references on the construction 
and use of objective tests revealed only five articles which dealt directly 
with problems of item writing. 

Most textbooks on educational measurement show a similar lack of 
emphasis on item writing. Reading the chapter titles of these textbooks, 
one would never guess that a major problem, perhaps the major problem, 
facing the test specialist is that of writing good items. There are a few 
notable exceptions. The Manual of Examination Methods (8), produced 
by the University of Chicago Board of Examiners in 1933, contains some 
suggestions for item writing, although it is devoted mainly to illustrating 
various types of test items and the application of these types in various 
subject areas. In 1936 a committee sponsored by the American Council on 
Education produced a manual. The Construction and Use of Achievement 
Exatninations (3), which contains specific suggestions for the writing of 
several widely used types of objective test items. More recently, Adkins and 
others prepared a volume on Construction and Analysis of Achievement 
Tests {!), which contains a section dealing with item writing. On the 
whole, however, the amount of reflection, discussion, and publication di- 
rected to the problem of writing objective test items has fallen far short of 
the requirements of the task. 

Research on problems involved in item writing has likewise been scanty, 
particularly in recent years. This is due in part to recognition of the diffi- 
culties involved in such research. Many early studies were inconclusive or 
actually misleading. These faults have been recognized, but no generally 
satisfactory means of avoiding them has been discovered. It is extremely 
difficult even to identify the variables which affect the quality or functioning 
of a test item. It is even more difficult to control these variables in an ex- 
perimental situation. As previously pointed out, the sense of values, the 
skills, and the ingenuity of the item writer play a large role in determining 
the quality of a test item. These characteristics are so complex and so diverse 
in manifestation that they are not easily subject to experimental control. 

A further criticism of many early studies is that they were directed 
toward inconsequential problems. For example, one popular subject of re- 
search was the characteristic reliability and validity of various item forms. 
It is now clear that any such characteristic differences as may exist among 
item forms are of trivial consequence when compared with the extreme 
differences observed among items of the same form. 

Thus, it is that few of the suggestions made in this chapter will be sup- 


ported by concrete research findings. In the main, they represent distillations 
of the experience of those who have been successful in test construction. 
Such distillations of experience are the best available guides to effective 
item writing. It is to be hoped, however, that aggressive attempts will be 
made in the near future to verify at least some of the recommendations 

Ideas for Test Items 

The Nature of Item Ideas 

Every test item begins with an idea in the mind of the item writer. The 
production and selection of ideas upon which test items may be based is 
one of the most difficult problems facing the item writer. While the test 
plan (see chapter 6) outlines the areas to be covered by the test and indi- 
cates the relative emphasis each area should receive, it does not ordinarily 
specify the content and purpose of each individual test item. The item writer 
is given the responsibility of producing ideas and developing them as 
items that will satisfy the specifications of the best plan. The quality of 
the test produced depends to a considerable extent upon the item writer's 
success in dealing with this problem. 

If expressed formally, the idea for a test item would consist of a state- 
ment identifying some knowledge, understanding, ability, or characteristic 
reaction of the examinee. The following are examples of item ideas. 

a) Knowledge of the steps in the enactment of a federal law. 

b) Understanding of the relation of tides to the position of the moon. 

c) Ability to add two terms to the first three of a geometrical progression. 

d) Reaction to picketing by members of a union. 

e) Ability to infer the meaning of a word from the context in which it is 

In actual practice item ideas are seldom formally stated. Usually they 
exist only temporarily and with no verbal explicitness in the mind of the 
item writer. An idea occurs to him, he judges its probable contribution to 
the test, and, if this is satisfactory, he proceeds immediately to write it in 
approximately its final form. This procedure is quite adequate in practical 
test construction. There are few cases in which a useful purpose would 
be served by preliminary formal statement of the item idea. But in the 
present discussion it is useful to consider the essence of an item apart from 
the phrasing in which it is presented. 

The Sources of Item Ideas 
There is no automatic process for the production of item ideas. They 
must be invented or discovered, and in these processes chance ideas and 


inspirations are very important. It is possible, however, to stimulate these 
processes by appropriate material. One source of stimuli is provided by 
instructional materials. Textbooks, course outlines, statements of objectives, 
lists of essential principles or basic abilities or frequent errors or common 
misunderstandings, discussion questions, and even questions from other 
tests are likely to suggest useful item ideas. 

A second type of material which is likely to stimulate the production 
of item ideas is provided by the written work of the students themselves. 
Their expressions of ideas on issues and problems may reveal points of 
difficulty which can be made the basis of discriminating test items. A third 
source of item ideas is provided by job analysis. This procedure, borrowed 
from the constructors of selection tests for use in government, business, 
and industry, may also be applied in the construction of educational achieve- 
ment tests. In using it, the item writer asks, "What is an individual who is 
proficient in this area expected to be able to do?" or, "How wall the indivi- 
dual who is proficient in this area differ from one who lacks proficiency?" 
The answers to these questions may suggest valuable ideas for test items. 

The difficulty of obtaining item ideas depends, of course, upon the nature 
of the items desired. If the purpose of a test is simply to determine whether 
or not the examinees possess certain information, the item writer needs only 
to consult various sources of this information and base items on some of 
the statements he finds there. The simplicity of this process is probably 
one of the chief reasons why educational achievement tests have been so 
often overloaded with informational items. If, on the other hand, a serious 
effort is being made to measure understanding, ability to interpret and 
evaluate materials, and similar characteristics of the examinee, the task of 
obtaining item ideas is much more difficult. In this case the item writer must 
acquire a thorough understanding of the subject and must work to invent 
appropriate novel situations. 

The Selection of Item Ideas 

The process of selecting item ideas goes on simultaneously with the 
process of inventing them. Skill in item writing depends not only upon 
prolific inventiveness but also upon discriminating judgment in the selec- 
tion. In selecting item ideas, the writer must consider their appropriate- 
ness, importance, and probable discriminating ability. - 

Obviously the item ideas should be appropriate, which means that they 
must be in keeping with the test plan. They should deal otily with those 
aspects of achievement that the test is intended to cover, but they should 
deal with all of those aspects. The number of items related to each aspect 


of achievement should reflect the allocation of emphasis specified in the . 
test plan. 

Item ideas should also be selected on the basis of importance. While 
the test plan usually indicates the general areas of achievement that are 
regarded as important, it is the responsibility of the item writer to select 
important specific aspects of achievement as item topics. 

The term "importance" as here used is almost synonymous with "use- 
fulness" in its broadest sense. The ability to add is important because it 
is so frequently useful. The ability to apply artificial respiration is im- 
portant because, once in a lifetime, it may be critically useful. The con- 
cepts of atomic structure are important because they are, within a limited 
area of human activity, fundamentally useful as a basis for understanding 
physical and chemical phenomena. 

In contrast to these fundamental, crucial, or frequently used aspects of 
achievement are the superficial details, the incidental observations, and the 
explanatory illustrations and comments that have no enduring significance. 
For example, it is relatively unimportant that the birth year of Woodrow 
Wilson coincided with the first Republican presidential campaign. Yet 
many items dealing with such trivia have found their way into tests of edu- 
cational achievement. 

Finally, in the selection of item ideas it is necessary to consider their 
probable ability to discriminate between those who possess and those who 
lack a given understanding or ability. The chief purpose of most tests of 
educational achievement^ is to rank examinees as accurately as possible in 
order of their attainments. Only those items that are answered correctly 
by the better-qualified examinees and missed by those who are not well 
qualified contribute to the effectiveness of such tests. While an item writer 
cannot be certain in advance how a given item will perform, his selection 
of item ideas should be guided by several general principles. Aspects of 
achievement that are thoroughly mastered by all examinees, or those that 
for various reasons are hardly mastered at all, are likely to provide few use- 
ful discriminations. Further, if the specific ability called for in an item is 
not highly related to general proficiency in the area covered by the test, the 
item will discriminate poorly. Of course, the discriminating ability of an 
item also depends upon how well it is written, but even the best of writing 
will not convert some ideas into discriminating test items. 

Inclusion of poorly discriminating items is occasionally justifiable because 
of the contribution which the item makes to the apparent validity of the 
test, or because of its probable influence on study and teaching. For ex- 

* The statements in this paragraph do not apply to diagnostic tests or to mastery tests. 


ample, the plans for tests in American history usually call for a few items 
on social history. Because of the widely prevalent emphasis on political and 
economic history in American schools, good students as a group know 
little more about social history than do their less-favored classmates. But 
even though these items fail to discriminate, they may serve a useful pur- 
pose. The stated objectives of most history courses include some mention 
of social history, so that a test which purports to cover the field adequately 
can hardly avoid including such items regardless of their effects upon the 
test statistics. Further, such items emphasize the importance of social 
history to students and teachers. In the areas where teaching practices are 
somewhat behind the recommendations of leaders in the field, tests can be 
of some help in leading the way toward improvement. 

The Forms of Test Items 

The form of an objective test item is determined by the arrangement of 
words, phrases, sentences, or symbols composing it, by the directions to 
the examinee for response to it, and by the provision made for recording 
the response. 

A wide variety of item forms have been suggested and used. One manual 
on test construction lists thirteen major forms with many variations of 
each. It will not be necessary to consider more than a few important types 
in this chapter. A few popular forms account for the bulk of ail items 
written. Further, most of the problems that arise in using the more com- 
mon forms also arise in using other forms, and the principles leading to 
successful use of the forms discussed here apply also to other forms. 

It is worth noting that all objective item forms may be divided into 
two main classes. On the one hand are items to which the pupil must 
respond by supplying the words, numbers, or other symbols which con- 
stitute the response. On the other hand are items to which the pupil re- 
sponds by selecting a response from among those presented in the item. 
Between the supply type and the selection type of items there are real dif- 
ferences, but these differences have frequently been misinterpreted so 
that various forms have been credited with merits and faults that they 
do not possess. 

The forms that will be described here include the short-answer, which 
represents the supply type, and the true-false, multiple-choice, and 
matching, which represent the selection type. The item, writer needs to 
be thoroughly familiar with each of these forms. But it is important 
to note that differences in form do not constitute the only or even the 
most significant differences among test items. It is now recognized that 


many earlier studies of the relative merits of different forms lack signifi- 
cance because they failed to consider characteristics and to control vari- 
ables that are more important than form. 

The Short-Answer Form 
The short-answer form is characterized by the presence of a blank 
on which the examinee writes the kind of answer called for by the 
directions. This form is no longer widely used aside from informal 
classroom testing, and even there it has tended to lose favor. It is dis- 
cussed here because misconceptions concerning its value and applica- 
bility still persist, and because it provides an opportunity to emphasize by 
contrast some of the advantages of other forms. Three varieties of the 
short-answer form may be illustrated. 

1. The question variety 

1. Who invented the cotton gin ? 

2. How many calories will be required to change eight grams of ice at 0° C. 
to steam at 100° C? 

2. The completion variety 

1. Snowbound was written by 

2. The body of an insect is divided into three parts: 

Insects have antennae 

and . legs. They breathe by means of 

3. The identification or association variety 

1. After each city write the state in which it is located. 

Detroit . New Orleans __ 

Chicago . San Antonio 

Seattle . Atlanta 

The True-False Form 
The true-false form consists of a statement to be judged true or false. 
It is essentially a two-response item in which only one of the possible 
alternatives is explicitly stated. Several other forms are closely enough 
related to the true-false form to justify considering them as simple 
varieties of the true-false form. These are the question which is to be 
answered yes or no, and the statement which is to be judged correct 
or incorrect. Illustrations of this form are given below. 

1. The true-false variety 

This variety consists of a declarative statement that is true or false. 

The pressure in a fixed quantity of a gas varies directly as its 

volume if the temperature remains constant. T F 


2. The right-tvrong variety 

This variety consists of a sentence, equation, or other expression that 
is to be marked right or wrong depending on whether it is correctly 
or incorrectly written. 

Glancing down the famous street, signs of every kind were 

visible. (Sentence structure) R W 

3. The yes-no variety 

This variety consists of a direct question that is to be answered yes 
or no. 

Does a ripsaw have larger teeth than a crosscut saw ? Y N 

4. The correction variety 

In this variety the examinee is directed to make every false statement 
true by suggesting a substitute for the underlined word. This variety 
combines selection of responses with the supplying of responses to some 

The use of steam revolutionized transportation in the 17th 










5. The cluster variety 

This variety consists of an incomplete stem with several suggested 
completions, each of which is to be judged true or false. 

The volume of a mass of gas 

1. tends to increase as the temperature increases. 1. 

2. tends to increase as the pressure increases. 2. 

3. may be held constant by increasing pressure and de- 
creasing temperature. 3. 

4. may be reduced to zero by increasing the pressure and 
decreasing the temperature. 4 

The Multiple-Choice Form 

The multiple-choice form consists of the item ste77i (an introductory 
question or incomplete statement) and two or more responses (the sug- 
gested answers to the questions, or completions of the statement). In this 
discussion the correct response or responses will be called the answer (s), 
and the incorrect responses, the distracters. 

The multiple-choice form is by far the most popular one in current 
use. It is free from many of the weaknesses inherent in other forms. It is 
adaptable to a wide variety of item topics. While it has often been used 
to measure superficial verbal associations and insignificant factual details. 


and while many examples of poor multiple-choice items can be found, 
it can also be used with great skill and effectiveness to measure complex 
abilities and fundamental understandings. 

Many variations of the multiple-choice form have appeared. It is con- 
venient to discuss these variations as if each constituted a distinct variety 
of the multiple-choice form. Actually, of course, these variations affect 
different characteristics of the item and may be combined in various 
ways. A single multiple-choice item may thus possess the characteristics 
attributed to several distinct varieties here. 

1. The best-answer variety 

This variety of the multiple-choice form consists of a stem followed 
by two or more suggested responses that are correct or appropriate in vary- 
ing degrees. The examinee is directed to select the best (most nearly 
correct) response. 

1. Which one of the following factors contributed most to continued in- 
flation in 1948.? 

a. Reduction in income taxes. 

b. Bumper crops. 

c. High rate of consumer-purchasing. 

d. Increased payments to veterans. 

2. What is the ^dJ/V purpose of the Marshall Plan .'' 

a. Military defense of Western Europe. 

b. Re-establishment of business and industry in Western Europe. 

c. Settlement of differences with Russia. 

d. Direct help to the hungry and homeless in Europe. 

2. The correct-answer variety 

This variety consists of an item stem followed by several responses, 
one or more of which is absolutely correct while the others are incorrect. 

3. Who invented the sewing machine.? 

a. Singer d. White 

b. Howe e. Fulton 

c. Whitney 

The difference between the best-answer and the correct-answer variety 
is more one of topic than of form. The name of the inventor of the 
sewing machine is recorded in history beyond question or doubt. The 
causes of inflation or the purpose of the Marshall Plan cannot be stated 
with any such precision. 


3. The mult't pie-response variety 

When the item writer is dealing with questions to which a number 
of clearly correct answers exist, it is sometimes desirable to include two 
or more correct answers in the choices offered. When this is done, and 
the examinee is instructed to mark all correct responses, the item variety 
is designated as the "multiple-response" form. 

4. What factor or factors are principally responsible for the clotting of 
the blood ? 

a. Oxidation of hemoglobin. 

b. Contact of blood with injured tissue. 

c. Presence of unchanged prothrombin. 

d. Contact of blood with a foreign surface. 

A multiple-response item must also be a correct-answer item. It can 
never be a best-answer item. 

In practice the multiple-response item is not essentially different from 
the cluster variety of true-false item. This is particularly true if each 
response is scored as a separate unit, so that the examinee receives one 
point for each response he does mark that should be marked, and one 
point for each reponse he does not mark that should not be marked. 

It is, of course, possible to score multiple-response items on an "all-or- 
none" basis. In this method of scoring, the examinee does not receive credit 
for an item unless he marks all the correct responses and only those. 

4. The incomplete-statement variety 

Quite frequently the introductory portion of a multiple-choice item 
(the item stem) consists of a portion of a statement rather than a 
direct question. 

5. Millions of dollars worth of corn, oats, wheat, and rye are destroyed 
annually in the United States by 

a. rust 

b. mildews 

c. smuts 

d. molds 

5. The negative variety 

To handle questions that would normally have several equally good 
answers, item writers sometimes use a negative approach. The responses 
include several correct answers together with one that is not correct or 


which is definitely weaker than the others. The examinee is then instructed 
to mark the response that does not correctly answer the original question, 
or that provides the least satisfactory answer. 

6. Which of these is not true of a virus ? 

a. It can live only in plant and animal cells. 

b. It can reproduce itself. 

c. It is composed of very large living cells. 

d. It can cause disease. 

6. The substitution variety 

The multiple-choice form has been utilized by item writers in test- 
ing a student's ability to express himself correctly and effectively. Samples 
of originally well-written prose or poetry are systematically altered to 
include errors in punctuation, spelling, word usage, and similar con- 
ventions. Selected words or phrases in these rewritten passages are under- 
lined and identified by number. Several possible substitutions for each 
critical phrase are provided. The examinee is directed to select the phrase 
(original or alternative) that provides the best expression. 

7. Selection Items 

Surely the forces of education should 7. a. , for 

b. . For 
be fully utilized to acquaint youth with c. — for 

d. no punctuation 
the real nature of the dangers to de- 8. a. as good or better 

opportunities than 
mocracy, for no other place offers as b. as good opportunities 

7 or better than 

c. as good opportunities 
good or better opportunities than the as or better than 

8 d. better opportunities 

school for a rational consideration of 9. a. rational 

9 b. radical 

c. reasonable 
the problems involved. d. realistic 

7. The incomplete-alternatives variety 

In some cases, an item writer may feel that the suggestion of a correct 
response would make the answer so obvious that the item would func- 
tion poorly or not at all. He may then resort to incomplete or coded 
alternatives. For example, the examinee may be asked to think of a 
one-word response and to indicate that response on the basis of its first 


8. Thomas Aquinas was an important figure in the development of which 
school of philosophy ? 

1. a to e 4. p to t 

2. f to j 5. u to 2 

3. k to o 

Since the correct answer to this item is "scholasticism," he should mark 
response 4. 

The use of incomplete responses makes possible the objective measure- 
ment of such traits as active vocabulary. In tests for this purpose it is 
essential to force the examinee to think of the appropriate response 
himself. The following item illustrates this application. 

9. An apple that has a sharp, pungent, but not disagreeably sour or 
bitter, taste is said to be 4* . 

1. R ~ 

2. S 

3. T 

4. U * The figure indicates the number of letters in the word 
c -y (in this case "tart"). This restriction serves to rule out 
■^ ■ * many borderline correct responses. 

Incomplete responses may also be used in arithmetical problems. The 
student may be directed to mark a choice on the basis of a certain 
digit in his answer, such as the third digit from the left. The use of 
incomplete responses for arithmetic problems prevents a student from 
using the responses as starting points for reverse, short-cut solutions of 
the problems. 

The incomplete-response variety represents a hybrid between the short- 
answer and multiple-choice form. It has the advantage of perfectly ob- 
jective scoring. However, like the short-answer form, it is limited to ques- 
tions for which unique simple correct answers exist. Further, unless the 
response categories are sharply delimited, credit may be given for wrong 
answers that happen to fall in the correct response category. 

8. The comhined-response variety 

This variety consists of an item stem followed by several responses, 
one or more of which may be correct. A second set of code letters indi- 
cates various possible combinations of correct responses. The examinee 
is directed to choose the set of code letters which designates the correct 
responses and to mark his answer sheet accordingly. The following is 
an example of the combined-response variety. 

1. Our present constitution 

a. was the outgrowth of a previous failure. 


b. was drafted in Philadelphia during the summer (May to September) 
of 1787. 

c. was submitted by the Congress to the states for adoption. 

d. was adopted by the required number of states and put into effect in 

1. a 

2. a, b 

3. a, b, c 

4. b, c, d, 

5. a, b, c, d 

This variety represents essentially a different method of scoring the 
multiple-response variety. It is limited to questions for which definitely 
correct or incorrect responses exist. There is no reason to believe that 
combined-response items are easier to score or yield more valid scores 
than straight multiple-response items. 

The Matching Form 

The matching form of objective test exercise consists of a list of 
premises, a list of responses, and directions for matching one of the 
responses to each of the premises. Names, dates, terms, phrases, state- 
ments, portions of a diagram, and many other things are used as premises. 
A similar variety of things may be used for responses. The distinction 
between premise and response is purely formal. In the present discus- 
sion the premises will be identified as those bearing the item number. 

Two chief varieties of the matching form are in common use. One is 
the simple matching exercise illustrated by the following example. 

On the blank before each of the following scientific achievements, place the 
letter that precedes the name of the scientist responsible for it. 

9. Demonstrated the circulation of the blood a. Louis Pasteur 

10. Demonstrated the statistical approach to b. Gregor Mendel 

human heredity c. Francis Galton 

. 11. Conducted crucial experiments on the d. Robert Koch 

mechanism of heredity e. William Harvey 

In items of this type the basis for matching is almost self-evident. The 
simple matching exercise is chiefly useful for identification of names, 
dates, structures, and similar associations. 

In the exercise above some of the responses do not match any of 
the premises. Matching exercises having this characteristic, which is often 
termed "imperfect matching," are widely used. They do not permit the 
examinee to determine the correct response to the "last" premise by 


Exercises in which each response matches one and only one premise 
are termed "perfect matching" exercises. These have a serious limitation 
in that not all of the premises can function as independent items. In a 
four-premise exercise, correct response to three of the premises guarantees 
correct response to the fourth. Perhaps the best type of matching exercise 
is that in which a response may match one, more than one, or none of the 

The other chief variety of the matching form is based on the classifica- 
tion of statements. 

Directions: In the following items you are to judge the effects of a particular 
policy on the distribution of income. In each case assume there are no 
changes in policy that would counteract the effect of the policy described 
in the item. For each item blacken answer space 

a. if the policy described would tend to reduce the existing degree of in- 
equality in the distribution of income; 

b. if the policy described would tend to increase the existing degree of in- 
equality in the distribution of income; 

c. if the policy described would have no effect, or an indeterminate effect, 
on the distribution of income. 

33. Increasingly progressive income taxes. 

34. Confiscation of rent on unimproved urban land. 

35. Introduction of a national sales tax. 

36. etc. 

This variety is well adapted to item topics dealing with explanations, 
criticisms, and other higher-level learning products. 

The Pictorial Form 

Test items based on pictures and graphical representations have not 
been widely used. This may be due to the fact that many item writers 
are not artists themselves and do not have the assistance of capable 
artists. It may also be due to their failure to recognize the advantages 
which often accrue from the use of pictures and diagrams. Picture 
test items have been used extensively in the training programs of the 
armed forces. The illustrative items shown in Figures 2, 3, and 4 have 
been taken from a bulletin. How To Make the Picture Test Item (4), 
published by the Department of the Army. This bulletin should be con- 
sulted for further suggestions on the use of items in this form. 

Pictorial items are of two general types — those in which the picture 
itself presents a problem in interpretation, and those in which the pic- 
ture serves as an effective means of communicating ideas. Figure 2 



illustrates the first type. Figures 3 and 4 illustrate the second. The 
pictorial material may consist of a photograph, a drawing, a diagram, 
a graph, or a map. Photographs or drawings usually picture objects, 
situations, or operations. 

Good pictorial items of the interpretive type can be justified as direct 
measures of useful skills. Those which use pictures as means of com- 
munication can be justified on other grounds. In the first place, a pic- 
ture often provides a more direct and easily understandable expression 





What is the reading on the vernier 
shown In the drawing at the right? 

A) 10.03 

B) 10.13 

C) 10. ii^ 

D) n.Oif 



1 ' 

Fig. 2. — Picture items may measure ability to use instruments, read maps, etc. (From 
Department of the Army bulletin [4}. Reproduced by permission of the Secretary of the 

than is possible using words alone. In the second, the use of pictures 
tends to reduce the typically heavy burden of reading comprehension, 
which often tends to prevent the measurement of nonverbal abilities. The 
argument that pictures make a test more interesting or attractive has 
been suggested, but probably merits less weight than the others. While 
the use of pictures has been generally neglected in test construction, a 
few enthusiasts have overused them. If a problem can be clearly expressed 
in a few simple words, it is not ordinarily improved by the inclusion 
of a picture. 

Some words are involved in almost every pictorial item. The sug- 
gestions made later for the construction of written items in various forms 

f n\ , [] 


21. The board Y is clamped to the 
"board X in order to 
a) prevent board X from splinter- 
ing when boring 

B) make the boring easier 

C) act as a stop for the bit 

D) make withdrawal of the bit 

Fig. 3. — Picture test item containing groups of objects to show relationship. (From 
Department of the Army bulletin [4}. Reproduced by permission of the Secretary of the 


Which is the best method of suspending a 
splice in spiral four cable? 

Fig. 4. — Picture test item showing right and wrong procedures. {Frutn Department of 
the Army bulletin [4}. Reproduced by permission of the Secretary of the Army.) 


will apply in general to items based on pictures. The one additional 
factor which needs to be considered is the quality and appropriateness of 
the pictorial material used. Unless skillfully drawn so that the objects 
or situations represented are easily recognized and the essential ele- 
ments receive proper emphasis, the drawing may defeat its own purpose. 

Characteristics of Objective Test Items 


The validity of an item for a given purpose depends considerably upon 
the idea it involves and upon the skill with which it is constructed. 
There is, however, a persistent belief that certain item forms are in- 
trinsically more valid than others. Short-answer items are often credited 
with high validity because they require the examinee to supply an answer 
of his own. Those who hold this point of view argue that the ability to 
supply an answer is more useful and indicates higher achievement than 
the ability to select one from suggested alternatives. Both of these argu- 
ments need examination. 

In the first place, the arguments appear strongest when one has in 
mind an item intended to test memory for specific facts. A short-answer 
item involving spelling or addition is clearly a more direct and, hence, 
intrinsically more valid measure of achievement in those areas than a 
choice-type item is. On the other hand, items in short-answer form have 
no such apparent advantage over choice-type items for measuring judg- 
ment or reasoning ability. In practice it has been found much easier 
to differentiate good from poor reasoners on the basis of their choice 
of conclusions than on the basis of their written statements of those 

In the second place, choice-type items only appear to give the examinee 
too much help. When critics ask how an examinee can fail to get the cor- 
rect answer when it is printed in plain sight, the experienced test con- 
structor answers that many examinees do actually fail to select the correct 
response. Item writers can prepare fair and clearly stated choice-type items 
in most areas that are difficult enough to discriminate between those who 
have and those who lack a given aspect of achievement. Contrary to 
naive expectation, many items become more, instead of less, difficult 
when transformed from short-answer to multiple-choice form. A generally 
correct notion that would be quite adequate for response to a short- 
answer item will not be adequate for response to a choice-type item that 


requires the examinee to make precise discriminations among several 
plausible alternatives. The following item illustrates this point. 

What causes the diaphragm of a telephone receiver to vibrate.'' 

1. Changes in voltage caused by a transformer. 

2. Changes of current strength in an electromagnet. 

3. Changes in position of carbon granules pressing against the diaphragm. 

4. Changes in the position of the condenser. 

When this question is used alone as a short-answer item, it requires only 
such simple general answers as "electrical currents," "magnetism," or 
"an electromagnet." When transformed into multiple-choice form, much 
more detailed understanding is required. 

In the third place, while a person who can on demand produce an 
answer demonstrates higher achievement of a particular type than one who 
cannot, it by no means follows that ability to produce answers is generally 
a more useful type of achievement than ability to make right choices. 
In fact, the reverse is probably true. Success in many occupations and 
activities is more closely related to the ability of making a wise choice 
among known alternatives than of producing new ideas. 

Finally, whatever case can be made for short-answer items on other 
grounds is weakened seriously by the ambiguity inherent in them, and 
by the consequent difficulty of scoring them fairly. These points are dis- 
cussed on page 207. 

Errors Due to Guessing 

Many users of objective achievement tests suppose that test scores de- 
rived from choice-type items are subject to serious errors due to guess- 
ing. Some also believe that the chief weakness of the true-false item is the 
frequency with which poorly prepared examinees make high scores by 
"coin tossing." It can hardly be denied that some chance response to 
choice-type items is possible, but the seriousness of the problem has been 
exaggerated. In this section only those aspects of guessing which are 
related to the forms of objective test items will be considered. The chapter 
on test administration and test scoring (pages 365-69) will discuss the 
matter in greater detail. 

In considering this problem, it is helpful to distinguish three bases 
for response to test items: certain knowledge or misinformation, informed 
guesses, and blind chance. Responses based on certain knowledge intro- 
duce no error to test scores. Responses based on blind chance contribute 
only error. The amount of error introduced by guesses depends upon the 


amount of information on which they were based. Guesses based on almost 
no information are practically equivalent to chance responses. Those 
based on almost complete information may be practically the same as cer- 
tain responses. It is a tenable hypothesis that all but the wildest guesses 
contribute more to the true variance of test scores than to their error 
variance, and hence constitute useful responses from the standpoint of 
measurement. Under this hypothesis, the only harmful "guesses" are those 
based on almost no information, which are not really guesses at all but 
chance responses. 

The amount of error introduced by chance responses depends in part 
upon the frequency of such responses. The conditions under which re- 
sponse by chance is attractive to the typical examinee need not occur 
commonly. An examinee who is interested in making the highest possible 
score and who has sufficient time seldom prefers a blind response to one 
based on consideration of all available information. Both motivation and 
time limits are at least partially under the control of the test adminis- 
trator. If they are handled properly, the frequency of chance response 
and, hence, the amount of error introduced by those responses can be 
made quite small. 

While the probability of a correct response to a single true-false item 
by chance is twice that of the corresponding probability for a four-choice 
item, it by no means follows that scores on a true-false test involve twice 
as much error as those from four-choice items. In fact, if chance responses 
are given to the same number of true-false and four-choice items, the error 
variance due to chance response is only one-third larger in the case of the 
true-false items than in the case of the four-choice items. Further, the 
effect of this error variance due to chance on the reliability of the test 
scores depends on the total variance of the obtained scores and on the 
potency of other sources of error in the scores. If empirical evidence of 
these factors is lacking, it is hardly justifiable to assume that true-false 
items are inevitably subject to much greater error through chance re- 
sponse than are four-choice items. 

Many test users do not recognize that the possibility of obtaining a 
high relative score on a well-constructed true-false test purely by chance 
is very small. Suppose (quite unrealistically) that 100 examinees respond 
blindly to each item of a 200-item true-false test, and that each examinee's 
score consists of the number of items he answers correctly. The expected 
mean of these chance scores would be 100. Assuming the distribution 
follows the binomial law, none of the examinees would be expected to 


receive scores of 120 or higher. In the long run chance scores larger than 
119 would occur less than three-tenths of 1 percent of the time. By giving 
due attention to the difficulty of the items, the test constructor can pro- 
duce a test yielding a distribution of scores whose lower limit would be 
above 120. Under these conditions the frequency with which an examinee 
could make a "good" score by "coin tossing" would be too small to con- 
cern the test constructor. 

Applicability of Various Forms 


Experience has shown that it is nearly impossible to phrase short-answer 
items on certain essential topics so that the same correct responses will be 
made by all those who know the answers. Item writers are often sur- 
prised by the variety of correct responses which appear to questions for 
which they had conceived only one answer. Most words and phrases 
have many synonyms and near-equivalents. Also, it is often possible to 
fill the blank of a short-answer item with words which are appropriate, 
but which are remote from the intent of the item writer. In the item 

Columbus discovered America in 

the examinee may appropriately respond with either "1492" or "the 
Sdnta Maria." Even mathematical problems which should yield precisely 
determinable answers cause trouble unless explicit directions have been 
given concerning the form and precision required in the answer. 

As creditable responses to a test item multiply, the scoring key becomes 
increasingly cumbersome, the scoring procedure more time-consuming, and 
the obtained scores less reliable. Further, it becomes inadvisable to en- 
trust the scoring to clerks who lack mastery of the subject matter in- 
volved. Practically speaking, only items involving simple computations 
or simple verbal associations can be handled with satisfaction in the 
short-answer form. These difficulties, of course, are less serious in class- 
room testing where the tests are scored by a competent teacher and where 
speed of scoring may be a secondary consideration. 

True-false items 

True-false items should be based only on statements that are absolutely 
and unambiguously true or false. A relatively small proportion of the 
significant statements that can be made on any subject satisfy this criterion. 
To meet the standard of absolute truth, a statement must be so precise 


in phrasing and so universal in application that it requires no additional 
qualifications and admits of no possible exceptions. 

If statements that are only approximately true are presented as items 
in a true-false test, they pose a difficult and unreasonable problem to 
the examinee. Not only must he know to what extent the statement is 
true, but he must also guess what degree of untruth will be tolerated 
by the scorer. 

Consider the problem of response to these hypothetical true-false state- 

1. The numerical value of pi is 3. 

2. The numerical value of pi is 3.141 6. 

3. The numerical value of pi, correct to four decimal places, is 3.1416. 

If presented in separate tests, it is safe to guess that most informed 
examinees would mark item 1 false and item 2 true. Yet both items 
are basically alike in being approximations. Each is true as far as it goes. 
Only by further qualification, as in item 3, can the item be made ab- 
solutely true. 

The difficulty involved in securing absolutely true statements for use 
in true-false items may be illustrated by this example. 

4. Calcium chloride attracts a film of moisture to its surface and gradually 
goes into solution. 

This item is true if and only if solid calcium chloride is in an atmos- 
phere containing moisture. 

The following item is also questionable. 

5. No satisfactory explanation has ever been given for the migration of 

Many explanations have been offered for the migration of birds. It is 
conceivable that one of them will ultimately prove correct. Further, the 
item does not specify who must be "satisfied" by the explanation. 
Ambiguity is also present in this statement: 

6. The nourishment assimilated by the body depends upon the amount of 
food eaten. 

No one would argue that there is a perfect relationship between assimila- 
tion and intake over all possible values of intake. It would be equally 
absurd to claim that there is no relationship. 

Although changes in wording would improve some of the foregoing 
items, the basic difficulty is not one of clear expression. Nor can the 


difficulty be avoided (although it may be reduced somewhat with some 
items) by the use of qualified response categories such as, "completely 
true, mostly true, mostly false, completely false." Instead, the ambiguity is 
an inevitable result of the attempt to apply an abstract standard of ab- 
solute truth to statements whose truth is relative, conditional, or approxi- 

This requirement of absolute correctness tends to limit the applica- 
bility and the validity of items in true-false form. Many important out- 
comes of instruction are generalizations, explanations, predictions, evalu- 
ations, inferences, and characterizations. Since these things often cannot 
be expressed in statements which are precisely and universally true, they 
cannot be tested effectively by true-false items. Analyses of responses 
reveal that it often is the brighter, better-informed students who sense 
the need for qualifications and the possibility of exceptions to statements 
presented as true-false items. When this happens, the item loses validity, 
and may even have negative validity. 

Recognition of this limitation has led some item-writers to use the 
true-false form only to test memory for simple facts. This is a needless 
restriction. With certain topics, well-constructed true-false items can stim- 
ulate fairly complex reasoning processes. Consider, for example, these 
statements : 

1. If a square and an equilateral triangle are inscribed in the same circle, 
the side of the square is longer than the side of the triangle. 

2. It is possible for an erect man to see his entire image in a vertical plain 
mirror one half as tall as he is. 

An ingenious item writer can borrow or invent many similar situations. 
When correctly understood, they lead to an unequivocal response of "true" 
or "false." If the examinee has not encountered the same situation be- 
fore, his correct response indicates ability to handle a complex reasoning 

The multiple-choice form 

The multiple-choice form is widely applicable. An item presented in 
short-answer, true-false, or matching form can always be converted into 
one or more multiple-choice items. Frequently this conversion improves 
the effectiveness of the item. Since there is only one correct response 
which an examinee can make to a multiple-choice item, the difficulty and 
subjectivity of scoring which plague the short-answer form are avoided. 
Since the multiple-choice form is adapted to the best-answer approach, it 


avoids the ambiguity associated with the apphcation of a standard of ab- . 
solute truth, which constitutes the chief weakness of the true-false form. 
Since each multiple-choice item may be independent, the problem of 
finding a number of parallel relationships, which frequently causes diffi- 
culty in the matching form, may be avoided. 

The matching jorm 

The matching exercise is poorly adapted to unique topics or test situa- 
tions. Since each of the responses should have some plausible relationship 
to each of the premises, both responses and premises must be relatively 
homogeneous. But many significant topics are unique and cannot be con- 
veniently grouped in homogeneous matching clusters. Consider, for ex- 
ample, the difficulty of incorporating items like the following, all of 
which deal with prices, in any homogeneous set of premises and re- 

1. Why did the prices of radios and tires decline during 1948 at a time 
when the prices of other consumer goods were still rising.'' 

a. Costs of manufacture declined. 

b. There is normally more competition in these industries. 

c. There was organized consumer resistance to high prices of these com- 

d. The backlog of consumer demand disappeared. 

2. Buyers' strikes against higher prices have demonstrated 

a. their effectiveness in the case of foods but not of other products. 

b. their effectiveness in rural areas but not in cities. 

c. their relative ineffectiveness in reducing prices. 

d. their effectiveness in reducing wholesale but not retail prices. 

3. Which one of the following factors contributed most to continued infla- 
tion in 1948? 

a. Reduction in income taxes. 

b. Bumper crops. 

c. High rate of consumer purchasing. 

d. Increased payments to veterans. 

It may also be pointed out that use of the matching form may exert 
an undesirable influence on the distribution of emphasis in the test. 
An item writer may have in mind a particular date-event relationship 
that is of considerable importance. It occurs to him that there are similar 
date-event relationships that, though not of similar importance, might 
be included in a single matching exercise. As a result, the test finally 
produced may have excessive emphasis in this area, while other important 


aspects of achievement, which do not lend themselves well to grouping 
in a matching test, may be neglected. 

Ease of Construction 

The difficulty of constructing an objective test item depends far more 
on the level of quality demanded than on the form in which the item 
is cast. When faulty items are accepted uncritically, the construction of 
items in any form appears quite easy. Further, the difficulty of construct- 
ing an item depends on the topic with which it deals. For example, items 
measuring memory for facts are far easier to construct than items measur- 
ing understanding. 

Popular belief, borne out by casual experience, is that short-answer and 
true-false items are easier to construct than multiple-choice items. Short- 
answer items resemble the informal questions with which the teacher 
is familiar. True-false items resemble the declarative statements which 
abound in textbooks. It thus appears quite a simple matter to build ob- 
jective short-answer and true-false items on the basis of these handy 
materials. Actually the borrowing of such questions and statements is 
hkely to yield unscorable or ambiguous items. 

The chief problem in the construction of short-answer items is to find 
questions each of which has a single correct answer. The search for such 
questions too often leads the item writer to concentrate on verbal asso- 
ciations and factual details and to avoid interpretations and other com- 
plex relationships. The practice of producing one variety of short-answer 
item by removing words from selected statements often yields vague, 
multiple-answer items. As an extreme example, consider this statement 
and the short-answer items derived from it. 

1. Tobacco is grown in the South. 

2. is grown in the South. 

3. Tobacco is grown in the 

While the original statement was perfectly correct, it is obvious that 
the items derived from it can be completed correctly in many dif- 
ferent ways. 

The chief problem in the construction of true-false items is to limit 
the ideas to those that can be expressed as absolutely true or false state- 
ments. A closely related problem is to word the statement so that it is 
unambiguous without being obviously true or false to the uninformed. 

The chief problem in the construction of multiple-choice exercises is 
to express a question or problem clearly in the item stem, to phrase 
the correct response defensibly, and to find attractive distracters which 


will permit the item to discriminate between those who have and those 
who lack the achievement involved. 

The chief problem in the construction of matching exercises is to find 
homogeneous premises and homogeneous responses for which a mean- 
ingful basis for matching exists. 

Ease of Scoring 

One of the chief reasons for the wide acceptance of objective test forms 
is the ease with which they can be scored. Choice-type items are much 
easier to score than short-answer items. Responses to true-false, multiple- 
choice, or matching items can be indicated by marking various positions 
on a separate answer sheet. These answer sheets can then be scored rapidly 
by clerical inspection if stencil keys are used or by electrical machines 
if responses have been marked with special pencils. 

The efficiency, and consequent widespread use, of the electrical test- 
scoring machine, or of stencil keys for clerical scoring, has tended to 
favor certain item forms at the expense of others. Multiple-choice, true- 
false, and to a lesser extent matching exercises are well adapted to this 
mechanical type of scoring. Some writers have viewed this development 
with alarm, fearing that useful old forms may be abandoned and use- 
ful new item forms may not be developed because they do not fit mechani- 
cal-scoring requirements. Whether or not these fears are justified is diffi- 
cult to say. One may take the position that more is likely to be gained 
in educational measurement through improved application of existing 
item forms than through the development of new forms, and that, except 
in rare instances, the unadaptability of certain item forms to mechanical 
scoring represents no real loss to educational measurement. On the other 
hand, one should certainly not ignore the possibilities of any new evalu- 
ation technique solely because it does not appear suitable for mechani- 
cal scoring. 

The use of mechanical devices as aids in test scoring has been a 
cause of both amazement and critical comment by individuals not directly 
concerned with measurement. They have asserted that a mechanical device 
can be useful only in evaluating mechanical performances. If the test is 
poorly constructed this assertion may be true — but no more so than if other 
scoring methods were used. Competent writers can produce items which 
require exceedingly complex and original thought processes whose out- 
come can be recorded in a simple mechanical fashion. The simplicity of 
the record of the outcome is not a necessary indication of simple procedure 
in arriving at the answer. 


Suggestions for Item Writing 

General Suggestions 

1. Express the item as clearly as possible. 

The production of good test items is one of tlie most exacting tasks 
in tiie field of -creative writing. Few other words are read with such 
critical attention to implied and expressed meaning as those used in 
test items. The problem of ambiguity in objective test items is particu- 
larly acute because each item is usually an isolated unit. Unlike ordinary 
reading material, in which extensive context helps to clarify the mean- 
ing of any particular phrase, the objective test item must be explicitly 
clear in and of itself. The power of an item to discriminate between the 
competent and incompetent may be seriously limited by lack of clarity. 
Except in the case of certain types of intelligence or reading test items, 
the difficulty of an item should arise from the problem involved rather 
than from the words in which it is expressed. Test items should not 
be verbal puzzles. They should indicate whether the student can pro- 
duce the answer, not whether he can understand the question. 

Lack of clarity in a test item may arise from inappropriate choice and 
awkward arrangement of words. It may also arise from lack of clarity in 
the thinking of the person who wrote it. Many ideas for test items are 
vague and general at first. Before emerging in final form, they need criti- 
cal examination and revision. In this process, clarification of ideas goes 
hand in hand with improvements in wording. 

It is difficult to provide a list of specific rules which, if followed, will 
guarantee clarity. The things that must be done and the things that must 
be avoided are numerous and varied. Further, their application in specific 
situations is a matter calling for expert judgment. It is worth noting, how- 
ever, that many of the suggestions made in the remainder of this section 
may be considered as elaborations of this first and most important sugges- 

2. Choose words that have precise meaning ivherever possible. 

Lack of clarity in an item frequently arises from inappropriate word 
choices. Many commonly used wofds and phrases have no precise mean- 
ing. Others have no meaning that applies accurately in the context in which 
they appear. 

In the following item, the words "yielded," "unofficially," and "points," 
are vague. 

1. In the 1948 presidential campaign, Truman yielded unofficially to Dewey 
on which of these points ? 



a. That the un-American activities investigations were not direct attacks 
on the Democratic party. 

b. That the tone of the pohtical campaign should be kept on a high level. 

c. That the civil rights program should be abandoned. 

d. That there should be an equal balance of Democrats and Republicans 
in Congress to insure sound legislation. , 

3. Avoid complex or aivkward word arrange?nents. 

The following item, dealing with the interpretation of a map of the 
Northern Hemisphere, was very awkwardly worded on the first attempt: 

1. What is the relative length of the shortest path between cities A and C, 
and the North Pole 7 

a. They are approximately equal. 

b. That from A is slightly longer than that from C. 

c. That from A is slightly shorter than that from C. 

d. That from A is twice as long as that from C. 

When revised, with the addition of two cities 
among the distracters, the item reads as follows. 

2. Which city is closest to the North Pole? 

a. City A 

b. City B 

c. CityC 

d. City D 

While this may not be the place for a dis- 
course on rhetoric, a few suggestions may be 
made to illustrate the general point involved. 
The structure of sentences used should be as 
simple as possible. It is often advantageous to break up a complex sentence 
into two or more separate sentences. A qualifying phrase should be placed 
near the term it qualifies. In general, it is desirable to make clear the point 
of the question early in the sentence and add qualifications or explanations 
later. Finally, it is often helpful for the item writer to ask himself, "Just 
what is the point of this item.''" In the answer to this question an item 
writer may find a simpler, more direct wording for his item. Or he may 
discover that it has little point, or one that is not worth testing. 

4. Include all qualifications needed to provide a reasonable basis for 
response selection. 

Frequently an item writer does not state explicitly the qualifications that 
exist implicitly in his own thinking about a topic. He forgets that a differ- 












^ — -45°- 




cnt individual, at another time, needs to have these qualifications specifi- 
cally stated. The following item illustrates this point. 

1. If a ship is wrecked in very deep water, how far will it sink? 

a. Just under the surface. 

b. To the bottom. 

c. Until the pressure is equal to its weight. 

d. To a depth which depends in part upon the amount of air it contains. 

A number of capable students selected response d, instead of the intended 
correct response b, because they considered the possibility (which the 
writer failed to exclude) that a wrecked ship might not sink completely 
but remain partly submerged. In that case, response d, while not good, is 
the best response available. 

2. The greatest loss in hail storms results from damage to 

a. livestock 

b. skylights 

c. growing crops 

Since the item does not specify loss to whom or in what part of the coun- 
try, there is no reasonable basis for response unless the examinee assumes 
that the item refers to the United States as a whole. 

3. Calcium chloride attracts moisture and goes into solution. 

While this statement is true in the typical atmosphere, it is not true in the 
atmosphere of a desiccator. Capable examinees will recognize this possi- 
bility, and perhaps miss the item because of their superior ability. 

5. Avoid the inclusion of nonfunctional words in the item. 

A word or phrase is nonfunctional when it does not contribute to the 
basis for choice of a response. Unnecessary words and phrases frequently 
fit easily in the wording of an item but prove on examination to be com- 
pletely irrelevant to the decisions that must be made in selecting the 

Sometimes an item writer may include an introductory statement in an 
effort to strengthen the apparent appropriateness or significance of the 
item. The following illustration, taken from a test of contemporary affairs, 
includes such a statement. 

1. While many in the U.S. fear the inflationary effects of a general tax 
reduction, there was widespread support for a Federal community-prop- 
erty tax law under which 

a. husbands and wives could split their combined income and file sep- 
arate returns. 


b. homesteads would be exempt from local real estate taxes. 

c. state income taxes might be deducted from Federal returns. 

d. farmland taxes would be lower. 

In order to answer this item, it is necessary to know only in what way the 
community-property laws affect tax computation. This question can be 
brought into sharper focus by rewording the item as follows, eliminating 
reference to "fear of inflation," "general tax reduction," or "widespread 

2. Community-property tax laws permit 

a. husbands and wives to split their combined income and file separate 

b. homesteads to be exempt from local real estate taxes. 

c. state income taxes to be deducted on Federal returns. 

d. farmland taxes to be lowered. 

Some introductory statements that are not strictly necessary may oc- 
casionally be justified as "window dressing" that helps to clarify the point 
of an item or to establish its importance. This use is illustrated in the 
following item. 

3. The pollution of streams in the more populous regions of the United 
States is causing considerable concern. What is the effect, if any, of 
sewage on the fish life of a stream .-* 

a. It destroys fish by robbing them of oxygen. 

b. It poisons fish by the germs it carries. 

c. It fosters development of non-edible game fish that destroy edible fish. 

d. Sewage itself has no harmful efl^ect on fish life. 

If, however, the irrelevant material is put in to make the answer less 
obvious or to mislead the examinee into choosing an otherwise weak dis- 
tracter, the result is totally bad. Not only does it tend to destroy the validity 
of the item as a measure of what it is intended to measure, but it weights 
undesirably the factor of reading comprehension as a determiner of correct 
response. In general, items should be kept as short as is consistent with 
clear statement. Examinees vary widely in speed of reading, and the inevit- 
able advantage given to the rapid readers in most objective tests should 
not be unnecessarily increased. 

6. Avoid unessential specificity in the stem or the responses. 

General knowledge is knowledge that may be applied in a variety of 
specific situations. The superior value of general knowledge over specific 
knowledge has long been recognized and should be reflected in tests wher- 


ever possible. The following item is undesirably specific both in the prob- 
lem presented and in the alternative responses. 

1. What percent of the milk supply in municipalities of over 1,000 was 
safeguarded by tuberculin testing, abortion testing, and pasteurization ? 

a. 11.1% 

b. 20.3% 

c. 31.5% 

d. 51.9% 

e. 83.5% 

A correct response to this item simply indicates that the student remem- 
bers what he has read or heard in class. It does not indicate that he has a 
general conception of the present status of safeguards to the purity of 
milk. Items stated in this way are powerful incentives to rote-learning 
and may unfairly penalize a student whose study habits are basically sound 
and effective. 

A better approach is employed in the following item: 

2. What techniques have been widely adopted in an effort to safeguard the 
purity of city milk supplies .'' 

a. Only pasteurization of milk. 

b. Only the elimination of diseased cows. 

c. Neither of the preceding has been widely adopted. 

d. Both have been widely adopted. 

It should not be assumed on the basis of the foregoing discussion that 
the more general an item is, the better it is. General statements are or- 
dinarily less precise than specific statements. They require more qualifica- 
tions and admit more exceptions. It is less easy to state an acceptable an- 
swer to a general question. Thus, there are disadvantages as well as 
advantages in the use of general questions. The item writer must strike 
a careful balance between the two in order to produce the best possible 
test item. 

7. Avoid irrelevant inaccuracies in any part of the item. 

Irrelevant inaccuracies are usually unintentional. Even though they may 
have nothing to do with the selection of the correct response or the elimina- 
tion of incorrect responses, their inclusion is undesirable for two reasons. 
In the first place, they reflect unfavorably on the information and ability 
of the item writer. In the second, they may become fixed in the mind of 
the student as true statements, suggesting or re-enforcing erroneous ideas. 
Consider the following example. 


1. Why did Germany want war in 1914? 

a. She was following an imperialistic policy. 

b. She had a long-standing grudge against Serbia. 

c. She wanted to try out new weapons. 

d. France and Russia hemmed her in. 

In many respects this is a significant and well-constructed item. It does, 
however, suggest something that has not been established historically. It 
is unlikely that Germany was actively seeking war in 1914. A more reason- 
able interpretation is that she was seeking certain goals and would accept 
war rather than give up those goals. A somewhat similar fault is illustrated 
by the following item. 

2. Studies of general intelligence indicate young people should be first 
permitted to vote at what age ? 

a. 16 c. 20 

b. 18 . d. 22 

This item implies that general intelligence is the only factor to be con- 
sidered in determining minimum voting age. Few political scientists or 
sociologists would accept this view. The real basis for response in this 
item is the examinee's information that mental ability, as defined in in- 
telligence testing, does not increase much in the typical individual after 
the age of eighteen. But it is undesirable to imply that this factor alone 
should determine the minimum age for voting. 

8. Adapt the level of difficulty of the item to the group and purpose 
for tvhich it is intended. 

Recommendations concerning item difficulty and the statistical basis on 
which they rest will be discussed in another chapter. It is sufficient here 
to point out that the usefulness of a test item depends in no small measure 
upon the appropriateness of its difficulty for the group of examinees who 
will take it. 

While subjective judgments of item difficulty have been found not 
highly accurate, an individual who is well acquainted with the general 
level of ability of the examinees and their typical performance on similar 
items can do a useful job of judging item difficulties. 

In connection with the adjustment of item difficulty, two pitfalls need 
to be avoided. The first is the application of a "minimum essentials" con- 
cept to an achievement test. According to this concept, a test should include 
only those questions that all students should, according to the objectives 
of instruction, be able to answer correctly. A test composed of items meet- 
ing this requirement will almost certainly be too easy to discriminate 


clearly between different levels of ability. The second mistake is the selec- 
tion of items from the standpoint of what the ideal student should know 
rather than in terms of what the typical student does know. An item 
writer who is unrealistic about the extent to which typical students ap- 
preciate the fine points of a subject is likely to produce unreasonably 
difficult items. 

The difficulty of an objective test item is not determined solely by the 
idea on which it is based. A writer can often construct several items on 
the same general idea that differ widely in difficulty. Actually, of course, 
the items may be testing different abilities, but it is difficult if not impos- 
sible to define the difference in abilities measured in terms other than the 
specific differences between the items themselves. 

Thus, it is a mistake to interpret item analysis data as indicating, for 
example, that a certain percent of students in a specified group understand 
osmosis, or the formation of hailstones. All one is justified in saying is 
that a certain percent are able to answer correctly a particular item on 
osmosis, or a particular item on the formation of hailstones. Low scores 
on a test indicate either low achievement or difi&cult items or both. High 
scores may reflect easy items rather than high achievement. 

An item writer can control partly the difficulty of multiple-choice and 
matching items by adjusting the homogeneity of responses or by making 
use of compound responses. The more homogeneous the responses to a 
multiple-choice item, the greater the difificulty of the item. Where responses 
are sufficiently heterogeneous, selection of the best is reasonably easy. Two 
versions of the same vocabulary item will illustrate this point. 

1. His gaunt companion 

a. beautiful 

b. healthy 

c. haggard 

d. youthful 

2. His gaunt companion 

a. ugly 

b. ill 

c. haggard 

d. aged 

Obviously, the responses in the second item are much more nearly alike, 
so that a more speciffc and hence, presumably, less widely possessed knowl- 
edge is required to select the best. Item writers who have a tendency to 
look for ffne distinctions and thus tend to produce relatively difficult 
items may frequently "ease up" the items by substituting distracters that 


are less like the answer. On the other hand, the item writer should be 
aware that such changes may also change the thing which the item is 
measuring. The hypothetical universe of potential items from which the 
test constructor draws his sample is usually large and representative enough 
so that it is not necessary to include many items which are inappropriate 
in difficulty for the group being tested. But if the alteration or elimination 
of certain items systematically reduces the emphasis on important objec- 
tives, the changes are definitely undesirable. 

Compound responses may be used to simplify an item which would 
otherwise be too difficult. This possibility is illustrated in the following 

3. The Nobel Peace Prize was awarded in 1947 to a 

a. group of scientists who refused to work on the atomic bomb. 

b. committee of church members organized to help war victims every- 

c. group of newspapermen who strongly supported the United Nations. 

d. group of lawyers who prosecuted Nazi criminals. 

4. When the precipitate of ferrous hydroxide is exposed on a filter paper 
what happens to it 1 Why .■^ 

a. It turns red due to oxidation. 

b. It turns green due to reduction. 

c. It turns black due to decomposition. 

d. It remains unchanged due to its stability. 

A response may be selected to item 3 by one who knows either which 
group received the prize or the basis on which it was awarded. Likewise, 
in item 4, a response may be selected on the basis of knowledge of the 
outcome of the experiment or on the basis of understanding of the funda- 
mental principles involved. 

Compound responses may also be used to increase the difficulty of an 
item by requiring the examinee to demonstrate two abilities in choosing 
a response. One common illustration of this possibility is found in items 
that require both an answer and an explanation. 

5. If a tree is growing in a climate where rainfall is heavy, are large leaves 
an advantage or a disadvantage ? 

a. An advantage, because the area for photosynthesis and transpiration is 

b. An advantage, because large leaves protect the tree during heavy rain- 

c. A disadvantage, because large leaves give too much shade. 

d. A disadvantage, because large leaves absorb too much moisture from 
the air. 


9. Avoid irrelevant clues to the correct response. 

The effect of irrelevant clues may be to make the item easier as a whole 
or, which is more serious, may change the basis upon which the item 
discriminates. If all students notice the clue and all respond correctly on 
the basis of it, the item becomes nondiscriminating and, hence, useless. 
If only the more capable examinees utilize the clue and all others overlook 
it (that is, if ability to use the clue is highly related to the ability the test 
is intended to measure), the item may not be seriously weakened. More 
commonly, however, a number of examinees who would not normally be 
able to choose the correct response notice the clue and respond correctly 
on the basis of it. In this case the presence of the clue definitely weakens 
the item. 

Irrelevant clues may be of several varieties. Clues may sometimes be pro- 
vided by pat verbal associations. 

1. What does an enclosed fluid exert on the walls of its container? 

a. Energy c. Pressure 

b. Friction d. Work 

It is necessary only to know that "exert" is commonly used with "pressure" 
to answer the question correctly. Another type of irrelevant clue is pro- 
vided by grammatical construction. 

2. Among the causes of the Civil War were 

a. Southern jealousy of Northern prosperity. 

b. Southern anger at interference with the foreign slave trade. 

c. Northern opposition to bringing in California as a slave state. 

d. differing views on the tariff and constitution. 

Quite obviously the item stem calls for a plural response, which occurs 
only in d. 

An item writer may provide irrelevant clues to the correct responses by 
consistently stating them more precisely and at greater length than the 

3. Why were the Republicans ready to go to war with England in 1812? 

a. They wished to honor our alliance with France. 

b. They wanted additional territory for agricultural expansion and felt 
that such a war might afford a good opportunity to annex Canada. 

c. They were opposed to Washington's policy of neutrality. 

d. They represented commercial interests which favored war. 

It should be noted that a single item cannot adequately illustrate this 
defect. If other items include incorrect responses phrased as elaborately 
and precisely as this correct response, the effect of the clue will be offset. 


However, there is a natural tendency for an item writer to be more careful 
in phrasing the correct response than in phrasing the distracters. Precision 
of phrasing has frequently been used by alert but poorly informed exami- 
nees as the basis for choosing the correct response. 

Irrelevant clues may also be provided by any systematic formal differ- 
ences between the answer and the distracters. Some item writers tend to 
place the answer in one favored position among the several alternatives. 
If, for example, an examinee observes that the third response given is 
most frequently the answer, or that the first response is seldom correct, he 
may use such observations as a basis for successful guesses on other items. 

One of the most frequently encountered types of irrelevant clue is pro- 
vided by common elements in the item stem and in the answer. An obvious 
example of this is provided in the following item. 

4. What led to the formation of the States Rights Party ? 

a. The level of Federal taxation. 

b. The demand of states for the right to make their own laws. 

c. The industrialization of the South. 

d. The corruption of many city governments. 

Common elements in stem and a response are frequently far less obvious 
than this, but they may spoil the effectiveness of an otherwise well-con- 
structed multiple-choice item. 

Occasionally an item writer will provide clues by inadvertently including 
interrelated items, so that the statement of one question, or its responses, 
provides a direct clue to the answer of another. For example: 

5. The term "biological warfare" refers to ■ 

a. the struggle of living things to survive. 

b. the use of disease-producing organisms to defeat or weaken an enemy. 

c. the conflict between evolutionists and anti-evolutionists. 

d. the use of drugs to help save lives in combat. 

6. How far has the use of disease-producing organisms in warfare been 
developed ? 

a. The idea has been suggested but not developed. 

b. "Biological warfare" has been developed somewhat but it is not yet 
ready for use. 

c. Techniques of "biological warfare" are developed and ready for use. 

d. "Biological warfare" was used extensively in World War II, especially 
by saboteurs. 

The stem and responses of item 6 provide a direct suggestion of the an- 
swer to item 5. 

Interlocking questions are undesirable because they cannot be depended 
upon to measure what they were intended to measure. The best safeguard 


against errors of this type is careful rereading of a test as a whole, with 
particular attention to the relationships among the items. 

Words like "all," "none," "certainly," "never," and "always" have 
been designated "specific determiners." They tend to operate as irrelevant 
clues to the correct response, especially when used in true-false items. 
Statements including them are predominately false. These words may be 
noted by clever examinees and used as a basis for choosing the correct 
response even when the fundamental knowledge or ability involved is 
lacking. Thus, specific determiners change the basis upon which an item 
discriminates and often destroy its usefulness. The falsity of the following 
sample true-false items is obvious because of the specific determiners 

1. All diseases require medicine for their cure. 

2. If water is brought to a boil, all bacteria in it will certainly be killed. 

3. If drinking water is clear, it is always safe for drinking. 

Irrelevant clues provided by pat verbal associations or by common ele- 
ments in the item stem and one of the responses may be used constructively 
in multiple-choice items. Deliberate planting of clues of this t}^pe in the 
foils tends to defeat the rote-learner or to make the distracters highly 
attractive to those whose knowledge is superficial. This practice increases 
the power of the item to discriminate between real and simulated achieve- 

10. In order to defeat the rote-learner, avoid stereotyped phraseology 
in the stem or the correct response. 

A rote response is here defined as one in which words are used with 
no clear conception of their meanings. Rote responses are usually based 
on verbal stereot)'pes. The following questions (with intended answers in 
parentheses) illustrate the opportunities that certain items provide for 
response by rote. 

What is the biological theory of recapitulation ? 
(Ontogeny repeats phylogeny.) 

Who was the chief spokesman for the '' AiJiericaji System"? 
(Henry Clay) 

What were the staple crops in the colonial South ? 
(tobacco, rice, indigo) 

In the first item the verbal stereot}'pe is in the phrase "Ontogeny repeats 
phylogeny" and its association with the term "recapitulation"; in the sec- 
ond, "Henry Clay's American System"; in the third, "staple crops." To 


demonstrate the stereotyped character of these phrases, the reader is invited 
to attempt the clarification of the phrases "Ontogeny repeats phylogeny," 
the "American System," and "staple crop." It is true that some examinees 
may know the meanings of these phrases, but it is also obvious that the 
items do not require this knowledge for successful response. 

The emphasis supply-type (see page 193) items place on some unique 
word or phrase as the answer makes them particularly subject to response 
by rote. When this form is used, it is almost impossible to avoid giving 
credit to the "lesson learner" who knows the words, whether or not he 
knows the meaning. Any correct answer supplied, however obviously 
stereotyped, must be given full credit. With choice-type items, on the other 
hand, it is possible to avoid such verbal stereotypes among the correct 
responses, or to work them into the distracters. The extent of rote-learning 
and the harm which it does were discussed by Lindquist at some length 
in The Construction and Use of Achievement Examinations (3), from 
which the following quotation is taken. 

Consider the following illustrations. The items given below were included 
in a battery of tests administered to a random sample of 325 physics students 
in Iowa high schools. 

1 . What is the heat of fusion of ice in calories .'' 
(Answered correctly by 75 per cent of the pupils.) 

2. How much heat is needed to melt one gram of ice at 0° C. .'' 
(Answered correctly by 70 per cent of the pupils.) 

3. Write a definition of heat of fusion. 
(Answered correctly by 50 per cent of the pupils.) 

4. The water in a certain container would give off 800 calories of heat 
in cooling to 0° C. If 800 grams of ice are placed in the water, the 
heat from the water will melt 

(1) all the ice 

(2) about 10 grams of the ice 
( 3 ) nearly all the ice 

(4) between 1 and 2 grams of the ice 
(Answered correctly by 35 per cent of the pupils.") 

5. In which of the following situations has the number of calories 
exactly equal to the heat of fusion of the substance in question been 
applied .' 

(1) Ice at 0° C. is changed to water at 10° C. 

(2) Water at 100° C. is changed to steam at 100° C. 

(3) Steam at 100° C. is changed to water at 100° C. 

(4) Frozen alcohol at —130° C. is changed to liquid alcohol at 
-130° C 

(Answered correctly by 34 per cent of the pupils.) 


It will be noted that these items progressively call for more and more 
thorough understanding of the heat of fusion of ice. Item 1 requires only 
a verbal association between "heat of fusion of ice" and "80 calories." This 
is the sort of association upon which physics pupils are frequently drilled in 
a more or less mechanical fashion until the association is firmly established. 
The success with which it has been established in this particular case is 
evidenced by the fact that 75 per cent of the pupils tested gave the correct 
response to this item. 

Item 2 is of essentially the same type as item 1, but employs a different 
phrasing from the pat form in which the question is usually stated. Even this 
slight variation in phrasing resulted in a 5 per cent decrease in the number 
of correct responses. 

The ability to supply the correct answer to either item 1 or 2 clearly can be 
of no functional value unless the pupil has some notion of the meaning of 
"heat of fusion." The data from item 3, however, indicate that there were 
many students who could make the verbal association called for in item 1 
or 2 who had no adequate understanding of the meaning of this term. (It 
may be noted that item 3 was scored in a very liberal fashion, and that many 
responses were accepted as correct that were technically imperfect.) 

Any student who really understood the definition provided in response 
to item 3 should have no difficulty in responding to items 4 and 5. It will 
be noted, however, that only 35 and 34 per cent, respectively, of the pupils 
responded correctly to these latter items. It is clearly apparent from these 
data that items such as 1, 2, and 3 above can provide only an inadequate 
basis on which to judge the pupils' understanding of the concept taught. 
Items of the type of 4 and 5 above are definitely superior. It is significant 
to observe in this connection that out of the 224 pupils who supplied the 
correct answer to item 1, only 16 per cent succeeded m all of the remaining 
items ; in other words, only one out of every six students who had acquired 
the verbal association between "heat of fusion of ice" and "80 calories" 
had acquired even the low level of understanding of these terms called for 
in items 2, 3, 4, and 5. 

Several factors have probably contributed to the presence in tests of 
verbal stereotypes that permit successful response by rote. In the first place, 
test items tend to reflect the character of the instruction that preceded them. 
Education has been criticized repeatedly, and with considerable justification 
for its preoccupation with verbal symbols and its neglect of the phenomena 
to which they refer. 

In the second place, it is much easier to build items that hold a pupil 
responsible for verbal associations than to build items that probe the pupil's 
understanding. Many item writers have borrowed statements from texts 
or references, simply rearranging the words, or introducing slight modifi- 
cations, to produce test items. Almost invariably such items can be an- 
swered successfully by anyone who has read the texts or references with 
some care and who possesses a good memory for verbal associations. 


The existence of this common fault is not recognized by many item 
writers, for it is unhkely tliat they would set out deliberately to test for 
rote-learning. But they apparently have failed to appreciate the extreme 
ease with which a stereotyped phrase, repeated in text or lectures or simply 
in common speech, is accepted and used by many who have only a vague 
idea of its original significance. 

Items of this type are harmful in two ways. In the first place, they do 
not provide a valid measure of the desired outcomes of instruction. Such 
items do not discriminate between those who understand and those who 
do not. The usefulness of purely verbal associations is strictly limited. 
Tests emphasizing such associations, if administered at the beginning and 
end of a course of instruction, may give the impression of striking progress 
on the part of the student when, in reality, permanent achievements are 
negligible. In the second place, such items exert an undesirable influence 
on teaching and learning. Concentrating the attention of teacher and pupils 
on word memory may crowd out efforts to develop understanding. 

The solution to this problem is to be found in better phrasing of items. 
Practical problem situations and fresh wording of essential ideas should 
be consciously sought. Efforts may be made to penalize the rote-learner 
by including attractive verbal stereotypes among the distracters, and by 
avoiding them as much as possible in the answers. 

11. Avoic^ irrelevant sources of difficulty. 

Just as it is possible inadvertently to incorporate clues to a correct re- 
sponse, it is also possible to place unintended obstacles in the way of the 
examinee. Quite frequently reasoning problems in mathematics are an- 
swered incorrectly by examinees who have reasoned correctly, but who 
have slipped in their computations. The following item was designed to 
measure the principle of price discounts. 

1. Mr. Walters was given a 12 1/2% discount when he bought a desk whose 
list price was $119.75. How much did he have to pay for the desk? 

A number of examinees who understood the principle missed the item 
because of errors in multiplication and in placement of decimal points. If 
the item is revised as follows: 

2. Mr. Walters was given a 10% discount when he bought a desk whose 
list price was $100. How much did he have to pay for the desk.^ 

the computational difficulty is almost entirely removed so that the prin- 
ciple alone can be tested. Test constructors may differ in their opinions 
concerning the advisability of eliminating computational difficulty from 


certain mathematics problems, but when complex or time-consuming cal- 
culations are included in the item, the item writer should recognize that 
he is chiefly testing computational skill rather than understanding of 
mathematical principles. 

Short-Answer Form 

1. Use the short-answer for/?/ only for questions that can he ansivered 
by a unique word, phrase, or number. 

The need for this restriction was discussed in the previous section deal- 
ing with the characteristics of the short-answer form. The implication of 
this restriction is that a test composed exclusively of short-answer items 
is almost certain to overemphasize vocabulary. It is probably safe to observe 
that written tests in general, both essay and objective, have always placed 
too great a premium on vocabular}' and too little upon other important 
aspects of achievement. 

2. Do not borrow statements verbatim from context and attempt to use 
them as short-answer items. 

Ambiguity of the item and perplexing variations in the answers are 
almost certain to result from this procedure. 

3. Make the question, or the directions, explicit. 
Avoid such indefinite questions as 

a. Who was George Washington .^ 

b. Where did Columbus land ? 

In computational problems, specify the degree of precision expected and 
indicate whether or not units of measurement must accompany numerical 

4. Alloiv sufficient space for pupil ansivers, and arrange the spaces for 
cofivenience in scoring. 

It is frequently convenient to have all the blanks in a single column at 
either the left or the right margin of the examination paper. 

5. In computational problems specify the degree of precision expected, 
or, better still, arrange the problems to come out even except where ability 
to handle fractions and decimals is one of the points being tested . 

If the correctness of a numerical response depends upon stating the unit 


of measurement, make this fact dear. If not, it is best to include the unit 
of measurement in the statement of the question as, for example, 

c. The volume of a cube nine feet on an edge is cubic feet. 

6. Avoid overmultilation of completion exercises. 

An extreme example of this may be observed in the following sample 

A hay affords another of the existing between 

A student with a good memory who had encountered this statement be- 
fore, might be able to puzzle it out and successfully fill the blanks with 
the words "infusion," "illustration," "relationship," "animals," and 
"plants." But it is obvious that far too many words have been removed to 
permit the item to pose a clear-cut problem. Even an expert biologist 
would find the item troublesome, and he would certainly brand it as trivial 
(unless he wrote it himself). 

The True-False Form 

1. Base true-false items only on statements ivhich are true or false with- 
out qualifications. 

Item writers frequently have used broad generalizations and other 
declarative statements as true-false items. While most true-false items are 
declarative statements, not all such statements make acceptable true-false 
items, for many of them involve exceptions and hence should be marked 
false, if truth is interpreted strictly, although they are true in general. 

2. Avoid the use of long and involved statements with many qualify- 
ing phrases. 

The difficulty with such statements is that the examinee has trouble 
identifying the crucial element in the item. If it is necessary to use many 
words in describing a complex situation for a true-false item, separate 
sentences should be used. The issue to be judged true or false should be 
set apart at the end of the item. 

3. Avoid the use of sentences borrowed from texts or other sources as 
true-false items. 

Very few textbook statements, when isolated from context, are com- 
pletely and absolutely true. Moreover, many of them are of value chiefly 
as supporting or clarifying material and are not in themselves highly 


significant. Finally, in many cases the meaning of an isolated sentence is 
not clear. 

The difficulties involved in borrowing statements for use as true-false 
items may be illustrated in the following examples. 

1. World War II was fought in Europe and the Far East. 

2. A remarkable transaction occurred toward the end of the reign of Con- 
stantine the Great. 

3. Colloids are near-solutions. 

Item 1 is true so far as it goes, but is not completely true, for it fails to 
mention Africa, the Atlantic, and other battle areas. Item 2 is clearly 
introductory and has no inherent significance. The meaning of item 3 is 
not clear enough to permit a decision of "true" or "false." It is unfortunate 
that many similar statements, lacking absolute truth, basic significance, or 
clear meaning, have found their way into true-false tests. 

Multiple-Choice Form 

1. Use either a direct question or an incomplete statement as the item 

There are some item ideas that can be expressed more simply and clearly 
in the form of incomplete statements than in the form of direct questions. 

1. The present Russian government is a 

a. democracy 

b. constitutional monarchy 

c. Communist dictatorship 

d. Fascist dictatorship 

If this item were written with a direct question as the stem it would re- 
quire more words and read less smoothly. 

2. The present Russian government is of which of the following types.'' 

a. A democracy 

b. A constitutional monarchy 

c. A Communist dictatorship 

d. A Fascist dictatorship 

On the other hand, some items require direct question stems for most 
effective expression. 

3. What part are physical scientists playing in public affairs at present? 

a. They concentrate on science and leave public affairs to others. 

b. They are working on a new form of government based on scientific 

c. They are taking greater interest in political and social questions than 
ever before. 


d. They have assumed positions of leadership and control in governments 
all over the world. 

The incomplete statement is less clear and direct. 

4. The present part played by scientists in relation to public affairs is one of 

a. withdrawal from active participation in order to concentrate on science. 

b. work on a new form of government based on scientific principles. 

c. greater interest than ever before in political and social questions. 

d. leadership and control in governments all over the world. 

At present there is no experimental evidence on the relative efficiency 
of the two types of stem. Some experienced item writers exhibit a strong 
preference for the direct question. Others prefer the incomplete statement. 
Probably the effect of stem type upon the quality of an item is not large. 
There are, however, indications that beginners tend to produce fewer tech- 
nically weak items when they try to use direct questions than ivhen they 
use the incomplete statement approach. Several reasons for this tendency 
may be suggested. 

First, because of its specificity the direct question induces the item 
writer to produce more specific and homogeneous responses. When an 
incomplete statement is used as the item stem, the writer's point of view 
may shift as he writes successive responses. This tends to confuse the 
examinee concerning the real point of the item. 

Second, it is usually easier for the item writer to express complex ideas 
(those requiring qualifying statements) in complete question form. The 
necessity of having the completion come at the end of an incomplete state- 
ment restricts the item writer. He is not free to arrange phrases or words 
to produce the clearest possible statement. 

Third, and most important of all, the writer of a direct question usually 
states more explicitly the basis on which the correct response is to be 
chosen. Contrast the two item stems below. 

5. hi comparing the exports and imports of the United States, we find that: 

6. In the United States, how does the value of exports compare with that of 
imports ? 

Item 6 obviously sets up a much more definite basis for choosing a cor- 
rect response than the first. The difference here is not inherent in the form, 
since item 5 could be improved without changing it to a direct question. 
However, there is a greater tendency for item writers to be vague when 
using incomplete statements than when using direct questions. Some in- 
complete item stems are altogether too incomplete, as in the following 


7. Merchants and middlemen 

a. make their hving off producers and consumers, and are, therefore, 

b. are regulators and determiners of price and, therefore, are producers. 

c. are producers in that they aid in the distribution of goods and bring 
the producer and the consumer together. 

d. are producers in that they assist in the circulation of money. 

Restatement of this item using a direct question increases the number of 
words, but makes it much easier to understand. 

8. Should merchants and middlemen be classified as producers or non- 
producers ? Why } 

a. As nonproducers, because they make their living off producers and 

b. As producers, because they are regulators and determiners of price. 

c. As producers, because they aid in the distribution of goods and bring 
producer and consumer together. 

d. As producers, because they assist in the circulation of money. 

2. In general, include in the stem any words that must otherivise he 
repeated in each response. 

The following item is presented in two forms to illustrate this point. 

1. The members of the board of directors of a corporation are usually chosen 
by which of these .'' 

a. The bondholders of the corporation. 

b. The stockholders of the corporation. 

c. The president of the corporation. 

d. The employees of the corporation. 

2. Which persons associated with a corporation usually choose its directors? 

a. Bondholders 

b. Stockholders 

c. Officials 

d. Employees 

It is not always possible, or desirable, however, to eliminate all words 
common to the responses. In a preceding example, dealing with the activi- 
ties of merchants and middlemen, it was necessary to introduce each re- 
sponse with the word "as" to make grammatical sense. If the retention 
of common words in all of the responses makes the item easier to under- 
stand, they should be retained. In most cases, however, it will be found 
that the common words can be transferred to the stem without loss of 


3. // possible, avoid a negatively stated item stem. 

Experience indicates that this approach is likely to confuse the examinee. 
He is accustomed to selecting a correct response and finds it difficult to 
remember, in a particular isolated instance, to choose an incorrect response. 
The negative approach and the difficulty it frequently causes may be ap- 
parent in the following sample items. 

1. Which of these is not one of the purposes of Russia in consolidating the 
Communist party organization throughout Eastern Europe ? 

a. To balance the influence of the Western democracies. 

b. To bolster her economic position. 

c. To improve Russian-American relations. 

d. To improve her political bargaining position. 

2. Which of these is not true of a virus ? 

a. It is composed of very large living cells. 

b. It can reproduce itself. 

c. It can live only in plants and animal cells. 

d. It can cause disease. 

The use of a negative approach can sometimes be avoided by rewording 
the item, by reducing the number of responses, or both. Where use of 
negatively stated items appears to constitute the only satisfactory approach, 
they should be grouped together, under special directions to the examinee. 
Underlining, italicizing, or otherwise emphasizing the "not" is also es- 

4. Provide a response that competent critics can agree on as the best. 

The correct response to a multiple-choice item must be determinate. 
While this requirement is obvious, it is not always easy to fulfill. Some- 
times through lack of information but more often through failure to con- 
sider all circumstances, writers produce items that confuse and divide even 
competent authorities. For example, experts disagreed sharply over the 
best response to each of the following questions. 

1. What is the chief difference in research work between colleges and indus- 
trial firms .'' 

a. Colleges do much research, industrial firms little. 

b. Colleges are more concerned with basic research, industrial firms 
with applications. 

c. Colleges lack the well-equipped laboratories which industrial firms 

d. Colleges publish results, while industrial firms keep their findings 


2. What is the chief obstacle to free exchange of scientific information be- 
tween scientists in different countries ? 

a. The information is printed in different languages. 

b. The scientists wish to keep the information secret for their own use. 

c. Scientists do not wish to use second-hand information from other 

d. Countries wish to keep some of the information secret for use in time 
of war. 

The most obvious remedy for this type of weakness is to have the items 
carefully reviewed by competent authorities. Items on which the experts 
cannot agree in selecting a best response should be revised or discarded. 

Expert reviewers may frequently suggest desirable improvements in the 
wording of the item, but the item writer should not feel bound to accept 
these suggestions if they do not affect choice of the answer. Some suggested 
changes may actually weaken the item. Expert reviewers have a tendency 
to "split hairs at the Ph.D. level," and to prefer the technical jargon and 
stereotypes with which they are most familiar. The changes in wording 
they suggest may sometimes make the item more verbose and confusing 
to the examinees for whom it is intended, or may destroy its ability to 
discriminate those who understand from those who simply possess verbal 

5. Make all the responses appropriate to the item stem. 

Writers sometimes produce items in which none of the responses is 
reasonably correct. 

1. Loss due to hail is greatest in which of the following cases? 

a. To livestock 

b. To skylights 

c. To growing crops 

2. Why do living organisms need oxygen .-' 

a. Purification of the blood 

b. Oxidation of wastes 

c. Release of energy 

d. Assimilation of foods 

3. What process is exactly the opposite of photosynthesis? 

a. Digestion 

b. Respiration 

c. Assimilation 

d. Catabolism 

The responses to the first item are not "cases." The responses to the sec- 
ond item are not stated as reasons, as required by the stem. The third item 


illustrates a different type of difficulty. It asks a question which has no 
possible correct answer, since no process is exactly the opposite of photo- 
synthesis. In ail three cases the items can be improved by rewording. 

4. The greatest loss in hail storms for the country as a whole results from 
damage to 

a. livestock 

b. skylights 

c. growing crops 

5. Why do living organisms need oxygen .'' 

a. To purify the blood 

b. To oxidize waste 

c. To release energy 

d. To assimilate food 

6. What process is most nearly the opposite of photosynthesis chemically .■' 

a. Digestion 

b. Respiration 

c. Assimilation 

d. Catabolism 

One fairly obvious indication of inappropriate or carelessly written re- 
sponses is lack of parallelism in grammatical structure. This is illustrated 
in the following item. - 

7. What would do most to advance the application of atomic discoveries to 

a. Standardized techniques for treatment of patients. 

b. Train the average doctor to apply radioactive treatments. 

c. Reducing radioactive therapy to a routine procedure. 

d. Establish hospitals staffed by highly trained radioactive therapy 

The responses to a multiple-choice item should always be expressed in 
parallel form. Sometimes this can be achieved by a simple change in word- 
ing. In other cases it requires substitution of a more appropriate response. 
The revised and improved item is given below: 

8. What would do most to advance the application of atomic discoveries to 
medicine .'' 

a. Development of standardized techniques for treatment of patients. 

b. Training of the average doctor in application of radioactive treatments. 

c. Removal of restriction on the use of radioactive substances. 

d. Addition of trained radioactive therapy specialists to hospital staffs. 

6. Make all distracters plausible and attractive to examinees ivho lack 
the information or ability tested by the item. 


In addition to inappropriate distractcrs resulting from careless writing, 
there are others resulting from failure to consider plausibility. Consider 
the following item. 

7. Which element has been most influential in recent textile developments.'' 

a. Scientiiic research 

b. Psychological change 

c. Convention 

d. Advertising promotion 

Only the first response is plausible as an answer to question 7. Another 
example is provided by item 8. 

8. Why is physical education a vital part of general education ? 

a. It guarantees good health. 

b. It provides good disciplinary training. 

c. It balances mental, social, and physical activities. 

d. It provides needed strenuous physical exercise. 

The alert examinee would reason that nothing can guarantee good health, 
that disciplinary training is now in low repute educationally, and that 
strenuous physical exercise is seldom recommended. Such an item might 
function well as a test of understanding of verbal meaning, but it would 
not discriminate between those who do and those who do not understand 
the place of physical education in general education. 

Each distracter should be designed specifically to attract those examinees 
who have certain common misconceptions or who tend to make certain 
common errors. The mathematics test item which follows illustrates this 

9. The ratio of 25 cents to 5 dollars is 

a. 1/20 

b. 1/5 

c. 5/1 

d. 20/1 

e. none of these 

The examinee who carelessly overlooks the distinction between cents and 
dollars, or inverts the ratio, will arrive at one of the distracters rather than 
at the answer. Some item writers have found it helpful to first present 
multiple-choice item stems as free-response items, and then to use incorrect 
responses of some examinees as attractive distracters. 

7.. Avoid highly technical distracters. 

Item writers, needing additional distracters, are sometimes tempted to 


insert a response the meaning or applicability of which is completely 
beyond the ability of the examinee to understand. 

1. Electric shock is most commonly administered in the treatment of 

a. rheumatism 

b. paralysis 

c. insanity 

d. erythema 

The first three suggested responses are fairly common terms. The fourth 
is almost never encountered. It is definitely a "space filler" in this item, 
but it presents a frustrating problem to the examinee, since he is forced 
to choose a best answer without knowing the meaning of one of the an- 
swers. The level of information or ability required to reject a wrong re- 
sponse should be no higher than the level of ability required to select a 
correct response. When this is true, an examinee may sometimes arrive 
at his choice by successively eliminating incorrect answers. Response by 
elimination has been criticized, but it has one possible advantage over 
response by direct selection. The examinee may need more pertinent in- 
formation to eliminate three plausible distracters than he would need to 
select one correct verbal stereotype. 

8. Avoid responses that overlap or include each oth&j. 

An example of this defect is provided by the following item. 

1 . What percent of the total loss due to hail is the loss to growing crops ? 

a. Less than 20% 

b. Less than 30% 

c. More than 50% 

d. More than 95% 

This item is, in effect, a two-response item. The choice lies between re- 
sponses b and c. For if a is correct then h is also correct, and if d is correct 
then c is also correct. More subtle examples of this defect are occasionally 
encountered in item writing. 

9. Use "none of these" as a response only in items to ivhich an abso- 
lutely correct answer can be given; use it as an obvious answer several times 
early in the test but use it sparingly thereafter; and avoid using it as the 
answer to items in which it may cover a large number of incorrect responses. 

"None of these" is quite appropriate as a response to the correct answer 
variety of multiple-choice item. It is inappropriate in best-answer items. 
An examinee may properly reject all suggested responses if he is working 
under instructions to choose only completely correct answers. He cannot 


reasonably be asked to mark "none of tlicse" when his general instruction 
is to pick the best of several admittedly imperfect responses. 

"None of these" is a useful response for items in arithmetic, spelling, 
punctuation, and similar fields where conventions of correctness can be 
applied rigorously. It provides an easy-to-write fourth or fifth response 
when one is needed and may be more plausible than any other that can 
be found. It sometimes enables the item writer to avoid stating an answer 
that is too obviously correct. 

Two dangers are connected with the use of "none of these." The first 
is that it may not be seriously considered as a possible answer. The second 
is that the examinee who choses "none of these" as the correct response 
may be given credit for a wrong answer. To avoid the first danger, the 
examinee must be convinced at the beginning of the test that "none of 
these" is likely to be the answer to some items. This can be achieved by 
using it as the correct response to several easy items early in the test. 

The second danger can be avoided by sparing use of "none of these" 
as the correct answer after the beginning of the test, and by limiting its 
use as the answer to items in which the possible incorrect responses are 
relatively few. "None of these" would be an appropriate answer to the 
following item. 

1. What is the area of a right triangle whose sides adjacent to the right 
angle are 3 inches and 4 inches long respectively? 

a. 7 

b. 12 

c. None of these (answer) 

Some examinees may miss this item by simply adding 3 and 4. Others 
might multiply 3 by 4 and forget division by 2. Still others might add 
3, 4, and 5. Since the number of possible incorrect responses to this item 
is limited, they may all be included as distracters, so that only examinees 
who solve the problem correctly will be likely to choose "none of these." 
This situation does not prevail in the following item. 

2. What is the sum of 






186,226 (answer) 






None of these 

It is obviously impossible to anticipate all of the possible errors students 
might make in responding to this item. Hence, it would be undesirable to 
use "none of these" as the answer with only a few of the possible incorrect 
responses listed as distracters. It is far more appropriate to use it as a 


10. Arrange the responses in logical order, if one exists, but avoid con- 
sistent preference for any particular response position. 

Where the responses consist of numbers, they should ordinarily be 
put in ascending or descending order. If the responses are small numbers 
such as 1, 2, 3, 4, or 5, the 1 should occur in the first position, 2 in the 
second position, and so on. If this is not done, there will be a strong 
tendency for the examinees to confuse the absolute value of the answer 
with the response position used to indicate it. Even when the positions are 
lettered, the examinee may think of them numerically, and indicate a 
numerical response of 3 in the c position even though some other letter 
is used to represent 3 in the item. 

If an item contains one or more pairs of responses dealing with the 
same concept, these should usually be placed together. In the following 
item, it is preferable to arrange the responses as shown, rather than to 
distribute them at random among the choice positions. 

8. Which of these would you expect to be anti-inflationary in the United 
States ^ 

a. Increased consumption of goods. 

b. Increased exports to Europe. 

c. Limitation of credit to consumers. 

d. Limitation of the size of savings accounts. 

In many items, however, there is no objection to assigning the responses 
at random to the response positions. This gives the item writer an t3p- 
portunity to balance roughly the number of answers occurring in each 

Some writers have advocated that obvious answers should be placed in 
last so that the examinee will be forced to read and consider the distracters 
before seeing the correct response. There is no evidence concerning the 
effectiveness of this procedure, and it appears to be of doubtful value. If 
the answer is so obvious that the examinee will choose it the moment he 
sees it, placing it last is not likely to help the item much. 

11. If the item deals with the definition of a term, it is often preferable 
to include the term in the stem and present alternative definitions in the 

The reason for this suggestion is that it usually provides more opportuni- 
ties for attractive distracters and tends to reduce the opportunity for correct 
response by verbal association. Consider these illustrations. 

1. What name is given to the group of complex organic compounds that 
occur in small quantities in natural foods and are essential to normal 
nutrition ? 


a. Nutrients 

b. Calories 

c. Vitamins 

d. Minerals 

2. What is a vitamin? 

a. A complex substance necessary for normal animal development, which 
is found in small quantities in certain foods. 

b. A complex substance prepared in biological laboratories to improve 
the nutrient qualities of ordinary foods. 

c. A substance extracted from ordinary foods, which is useful in destroy- 
ing disease germs in the body. 

d. A highly concentrated form of food energy, which should be used 
only on a doctor's prescription. 

In the second item it is clear that more of the common misconceptions 
about the meaning of the term "vitamin" can be suggested and made at- 
tractive to the superficial learner. 

12. Do not present a collection of true-false statements as a multiple- 
choice item. 

Such items usually reveal the item writer's failure to identify or specify 
a single problem. In some cases, the true-false statements are grouped 
about a single problem and could be easily reworded to make that problem 
specific. In other cases, the statements are so loosely related that they 
hardly constitute a single problem at all. This situation is illustrated by 
the following item. 

1. What does physiology teach.'* 

a. The development of a vital organ is dependent upon muscular activity. 

b. Strength is independent of muscle size. 

c. The mind and body are not influenced by each other. 

d. Work is not exercise. 

Here two of the responses show some similarity. The other two are quite 
diverse. Grouping all in a single item leads the examinee to look for a 
common principle. It is difficult for him to arrive at any rational basis for 
selecting a best response. One beneficial change would be to replace re- 
sponses c and d by others dealing with muscles or muscle activity, and to 
reword the stem to point toward this problem. 

Matching Exercises 

1. Group only homogeneous premises and homogeneous responses in 
a single matching item. 

The premises and responses in the following item are not homogeneous. 


1. A drawing tool used primarily as a guide a. dividers 
to draw horizontal lines b. arc 

2. Avoidance of erasures, blots, uneven c. French curve 
lines, or poorly shaped letters d. neatness 

3. Any part of the circumference of a circle e. T-square 

Such an item measures only very superficial verbal associations. It is easily 
solved by those who have only vague concepts. 

2. Use relatively short lists of responses. 

Seldom should more than five alternative responses be suggested for a 
given group of premises. Two reasons for this recommendation concern 
the item writer. It is diflScult to maintain homogeneity in a long list of 
responses. Further, long lists of responses reflect concentration on one 
aspect of achievement which prevents proper distribution of emphasis on 
all aspects. The other reason concerns the examinee. With few responses, 
little time need be wasted in hunting for a proper response. The examinee 
may even fix the responses in mind so that his only problem is that of 
reading each premise and deciding which response best applies to it. 

The only necessary limitation to the number of premises is imposed by 
the requirement that they must all belong with the same homogeneous 
group of responses. It is often impossible to find a large number of 
premises to which the same group of responses constitute plausible match- 
ings. Even where it is possible, the item writer should probably use short 
lists of homogeneous premises of the same kind, so that he can sample 
more different topical areas. 

3. Arrange premises and responses for maximum clarity and convenience 
to the examinee. 

In general it is desirable to use the longer, more complex statements as 
premises, to arrange them at the left, and to number them as independent 
items. The responses should be arranged in order, if any logical basis for 
order exists, to simplify the task of matching. 

4. The directions should clearly explain the intended basis for matching. 

In simple matching exercises, the basis may be almost self-evident, but 
it should be made explicit in the directions. For classification-type items, 
specific instructions are needed. Illustrative items presented in the section 
on item forms show this detail in directions.^ 

'See p. 201. 


5. Do not attempt to provide perfect one-to-one snatching between 
premises and responses. 

The same response may be used for more than one premise. Occasionally 
responses that fit none of the premises should be included. Nothing is 
gained by attempting to provide equal numbers of premises and responses 
and to assure perfect matching. On the contrary, something is lost because 
the examinee may be given an irrelevant clue to one correct response. 

The Interpretive Test Exercise 

The interpretive test exercise represents a relatively new and highly 
promising development. Because it constitutes a larger unit than the typical 
test item, and because it presents special possibilities and problems, it is 
discussed here as a separate unit. 

The interpretive test exercise consists of an introductory selection of 
material followed by a series of questions calling for various interpreta- 
tions. The material to be interpreted may be a selection of almost any type 
of writing (news, fiction, science, poetry, etc.), a table, map, chart, dia- 
gram, or illustration; the description of an experiment or of a legal prob- 
lem; even a baseball box score or a portion of a musical composition. The 
questions on this material may be based on explicit statements in the ma- 
terial, on inferences, explanations, generalizations, conclusions, criticisms, 
and on many other interpretations. Since the interpretive exercise may 
employ any of several item forms, since it includes introductory materials 
as well, and since it has special possibilities for measurement, it deserves 
separate discussion. 

The following two illustrations suggest the general form and content 
of the interpretive test exercise, although they by no means represent all 
of its possible varieties. 

It has been stated that "like Hellas, the Swiss Land was born divided," 
and also that "political solidarity had a hard, slow birth in the mountains." 
Certainly the physical features of the Swiss lands in serving sharply to con- 
fine movement and widely to separate settled areas did not facilitate inter- 
course and thus political cooperation. In mountainous Switzerland, at any 
rate, village communes tended to occupy the narrow lateral valleys of the 
Alps, where they engaged in agricultural and pastoral pursuits in a state 
of almost complete political and economic isolation and self-sufficiency. On 
the other hand, the geographical position of the Swiss lands was such as 
to induce a continual current of traffic en route for the passes of the Central 
Alps, whilst the major valleys of the principal rivers formed the main high- 
ways of communication. Moreover, the Swiss plateau stretching between 
Lake Constance and Geneva and cupped between the mountain ranges 


formed a broad belt of well-watered and relatively low-lying land which was 
capable of supporting a population much denser than that of the mountains. 
Actually, it was not the more-favored plateau lands, but certain cantons of 
the mountains which provided both the leadership in the wars for inde- 
pendence and the nuclear region around which the state grew. The reason 
seems to be that in the mountain valleys the peasant and shepherd popula- 
tion tenaciously defended their freedom from the encroachment of the 
feudal powers and largely escaped being reduced to serfdom, as were the 
inhabitants of the central plateau. 

1. What is meant by "the Swiss land was born divided".'' 

1) There were many different religious sects. 

2) Different languages were spoken in different parts of the country. 

3) The mountains isolated the people in different parts of the country. 

4) The people fought among themselves. 

2. With which of the following does the writer compare Switzerland? 

1 ) Ancient Rome 

2) Ancient Greece 

3) Medieval Italy 

4) Medieval France 

3. Who took the lead in making Switzerland into a united nation.^ 

1 ) The traders 

2) The mountain people 

3 ) The farmers of the plains 
4) The serfs 

4. What does the writer try to do in this paragraph ? 

1) To describe the factors in the early commercial development of 

2) To point out why Switzerland can never become a united country. 

3) To explain how trade changed the character of the Swiss nation. 

4) To show how geographical conditions affected the political unifica- 
tion of Switzerland. 

5. Which of the following is the most appropriate heading for this para- 
graph ? 

1) Early Swiss Commerce 

2) Trade Routes and Their Effect on Switzerland 

3) Geography and Swiss Freedom 

4) Agriculture on the Swiss Plateau 

Presidential Electoral Votes in United States 
by Political Parties 
Year Republican Democratic Progressive 

1904 336 140 

1908 321 162 































1. Which party held the presidency during 1926? 

1) Republican 

2 ) Democratic 

3) Progressive 

4) The table does not tell 

2. In what year was the Republican victory the most decisive ? 

1) 1904 

2) 1924 

3) 1928 

4) 1936 

3. Which of these statements about Democratic party strength is supported 
by the table? 

1) The Democrats won easy victories in both 1912 and 1916. 

2) The Democrats have been by far the strongest political party since 

3) Democratic party strength has been slowly increasing since 1932. 

4) Democratic party strength has been slowly decreasing since 1936. 

4. Between which two consecutive elections was there the greatest increase 
in the number of Democratic electoral votes ? 

1) 1908 and 1912 

2) 1912 and 1916 

3) 1928 and 1932 

4) 1932 and 1936 

5. The percentage of the electoral votes received by the Democrats was 
the largest in what year ? 

1) 1944 

2) 1936 

3) 1928 

4) 1912 


1. The interpretive exercise provides an opportunity for measuring 
directly one of the important outcomes of instruction — the ability to inter- 
pret and evaluate printed materials. 

Throughout this chapter and in preceding chapters the desirability, as 


well as the difficulty, of measuring the important outcomes of instruction 
directly has been stressed. Certainly the importance of ability to interpret 
printed materials is beyond question in modern times. What one needs 
to know and what one must do are most often presented as printed dis- 
cussions or directions. These materials must be interpreted and evaluated. 
Their usefulness depends upon the accuracy and depth of penetration of 
the interpretations made. 

The ability to interpret reading materials assumes even greater impor- 
tance when general scholastic aptitude is under consideration. This ability 
is probably more essential than any other to success in most branches of 

It would be difficult to demonstrate empirically that interpretive exercises 
do actually measure ability to interpret. But such a demonstration is un- 
necessary. The tasks set in the questions of an interpretive exercise consti- 
tute in themselves an operational definition of interpretation. The usefulness 
of an interpretive exercise as measure of ability to interpret is limited only 
by the item writer's competence in selecting material to be interpreted and 
in formulating problems based on it. 

An interesting comparison can be made between essay tests, typical ob- 
jective tests, and interpretive exercises. The essay test, with which written 
testing began, asks, in effect, "What can you tell about this subject?" The 
prevalent types of objective test ask, in effect, "What do you know about 
this subject?" The interpretive exercise asks, in effect, "What are you able 
to find out from this material?" or "What are you able to do with it as 
background?" It is not the purpose of this paragraph to compare the 
merits of these three test devices. Rather it is to point out that the inter- 
pretive exercise occupies a somewhat unique and certainly an important 
place among the various test instruments. 

2. The interpretive exercise provides an effective setting in which to ask 
meaningful questions on relatively complex topics. 

In the independent item forms (multiple-choice, true-false, etc.) it is 
difficult to supply the necessary "raw material" with which an examinee 
may demonstrate his ability to organize, generalize, or evaluate. Often 
several paragraphs of material would be required. To include these in the 
item stem would make it cumbersome and inefficient. Once the material 
has been prepared and once the examinee has read it, the material provides 
the basis for no^ one, but many items. The group of related items which 
are part of the interpretive exercise make more complete use of these 
background materials than independent items could. 


3. The inleypyel'ne exercise reduces ambiguity by providing a common 
ground of information for both the item writer and the examinee. 

One of the sources of ambiguity in objective items is the fact that the 
item writer and the examinee approach the same question from different 
points of view. In answer to the question, "Where was Paul converted to 
Christianity?" one examinee wrote, "Acts IV," a perfectly correct answer 
from his point of view. The introductory material in an interpretive exercise 
helps to set the stage in more minute detail than is commonly possible in 
independent items. Where both the item writer and the examinee are 
working in the same explicit setting, it is much easier for the examinee to 
comprehend what the item asks, and hence to respond to it to the limit 
of his information and ability. 

4. The interpretive exercise requires both a general ability to interpret 
and a specific background of terms, facts, and principles related to the 
material presented. 

General skill in reading is an important factor in the ability of an 
examinee to interpret selected materials. Knowledge of the special terms 
and concepts used and familiarity with the general principles and structure 
of the specific field with which the material deals are likewise important. 
Within limits, the relative influence of these two factors is under the con- 
trol of the item writer. He can include items that call heavily upon gen- 
eral interpretive ability and make almost no demands on specific knowl- 
edge. Or he can emphasize specific knowledge and minimize general 
interpretive ability. 

It should be noted, however, that the interpretive exercise cannot, and 
is not intended to, supply a "pure" measure of informational background. 
Likewise, when it deals with specialized subject matter, it cannot supply 
a pure measure of general interpretive ability. 

For some purposes, pure measures are very useful. On the other hand, 
there are many situations in which the combination provided by the inter- 
pretive exercise is not only unobjectionable, but even preferable, to independ- 
ent measures. This is especially true if the item writer recognizes and uses 
his ability to control the relative influence of the two contributing factors. 

5. The interpretive exercise is ivell adapted to the evaluation of gen- 
eral levels of educational development for individuals having diverse back- 

In evaluating the general educational development of an examinee, it 
is almost useless to ask what he remembers about details of factual con- 


tent. What he knows on this level depends not only on how recently he 
has studied but also on where and by whom he was taught. 

It is almost equally useless to ask him to state principles or solve prob- 
lems as taught in the course, because, while there is more common agree- 
ment with respect to principles and methods of problem-solving, these are 
also acquisitions which grow dim with disuse. Thus, an examinee who is 
poorly developed educationally but who has just come from a course of 
instruction may, on a typical subject examination, outscore a more highly 
developed examinee who has been away from the subject matter for some 

The interpretive exercise has an important advantage for individuals of 
the latter type. They are not required to reproduce old learnings. Rather, 
they are asked to demonstrate that they have acquired a general background 
of experience, certain attitudes and other values, and familiarity with such 
useful ways of thinking as analysis, organization, etc. A skillful test con- 
structor using interpretive exercises can measure most of the important 
outcomes of general education. 

6. The interpretive exercise has wide applications, but its use presents 
some special problems. 

In the description of the interpretive exercise, mention was made of the 
wide variety of materials that could be presented for interpretation, and 
of the correspondingly wide variety of abilities that could be called for 
by the questions asked. This wide applicability has not been generally 
recognized, so that the interpretive exercise has not yet assumed the im- 
portant role to which it is entitled in the field of educational measure- 

On the other hand, some of the difficulties encountered in using inter- 
pretive exercises should be clearly recognized. Skilled, experienced item 
writers find it difficult to construct interpretive exercises of high quality. 
The selection or construction of suitable materials and the identification 
of item topics that make maximum use of the inherent possibilities of 
the material are added problems not encountered in ordinary item writing. 
The interpretive exercise is relatively time-consuming to administer. An 
hour of testing with interpretive exercises will produce fewer independ- 
ent scoring units than the same time devoted to independent test items. 
Partly because of this time factor and also because of the intrusion of gen- 
eral interpretive ability, the interpretive exercise is not an efficient measure 
of informational background as such. 


Suggestions for Writing 

1. Select the type of material to he interpreted for significance and rep- 

Attention should be given to various types of material, and the most 
suitable should be selected. In many respects the problem of selecting ma- 
terials to be interpreted is similar to the problem of identifying topics for 
independent items. The criterions of significance and the distribution of 
emphasis necessary to produce a good interpretive exercise vary from field 
to field. They can be determined only by one who is competent in the 

2. Write or rewrite the material to he interpreted so as to provide for 
the desired interpretations and to eliminate nonfunctional portions. 

It is rarely possible to find intact passages of material that are ideally 
suited for interpretive exercises. Most test passages must either be orig- 
inally written for the special purpose or developed through thorough re- 
vision of existing source materials. It goes without saying that the material 
should be clearly presented and should conform to the highest standards 
of form except where deliberate alterations are necessary to provide an 
opportunity for critical comments or for suggested revisions. Often the 
addition or elimination of a word or sentence will provide new opportuni- 
ties for significant questions. The task of producing an interpretive exercise 
is not simply one of finding suitable material and then writing questions 
on it. It is rather an integrated task in which preparation of the test 
exercises and modification of the material go hand in hand. 

The first step in constructing a test passage often consists of searching 
through some materials for a reading selection that seems to contain several 
promising possibilities for interpretive items. The next step is to construct 
tentative items that exploit all of the item possibilities which the passage 
presents. The third step is to retvrite the passage so as to eliminate from 
it anything that does not contribute to the items already built and that is 
not essential to the continuity of the passage. This condensation may re- 
quire some modification of the original items, or elimination of certain 
items entirely. A highly condensed version is thus produced that, with 
reference to the items already constructed, is far more efficient than the 
original passage, or that yields far more items "per line of passage" or per 
unit of total testing time. 

The next step is to reconsider this condensed version to determine 
whether, by further rewriting or additions, the basis for additional good 


items may be introduced. These changes may again require modification 
or even deletion of some of the original items. This process of reciprocal 
revisions and additions to the passages and the set of items continues until 
the writer feels that he has reached the point beyond which further im- 
provement is not worth the effort it costs. The version finally produced 
may sometimes have little resemblance to, or contain almost no intact 
paragraphs or sentences from, the original passage, but in all instances 
it is a much more efficient test exercise than the original. 

The preceding paragraphs describe roughly the manner in which most 
good interpretive exercises are built. Sometimes, an exceptionally competent 
test constructor may be able to write from scratch an original selection to 
satisfy certain predetermined specifications. Even then, the first draft of 
the passage will ordinarily require many revisions as the items are being 
constructed. In general, the construction of interpretive test exercises re- 
quires a much more opportunistic approach than is usually followed in test 

3. Construct the items in either multiple-choice or true- false form, with 
regard to all suggestions previously made for these forms. 

While the multiple-choice form is generally more adaptable to the ques- 
tions raised in interpretive exercises, there are situations in which the true- 
false item is entirely adequate and may even be preferred because of its 

4. Decide in advance how much emphasis should be placed upon a stu- 
dent's background infor??iation and then construct questions to provide 
the desired emphasis. 

It has been pointed out previously that the interpretive exercise provides 
the opportunity for questions that make much or little demand upon the 
student's background of special information. This possibility should be 
clearly recognized, and the items written in terms of a definite purpose. 


Throughout this chapter attention has been called to the many subtleties 
involved in item writing, and to the high degree of skill needed by the item 
writer in dealing with these subtleties. It would be unfortunate if the net 
eff^ect of these comments were to discourage potential item writers from 
undertaking the task. Awareness of ideals and high standards need to be 
balanced by a realistic view of the practical limitations in time and skill 
which often force the use of substandard tests. Such tests are often far better 
than no test at all. 


In view of the time required to produce good test items, there is a strong 
incentive for the exchange among writers of good items. A number of 
suggestions for cooperation along this Hne have been offered by leaders in 
the field. It is to be hoped that a cooperative agency for the selection, classi- 
fication, and distribution of well-written test items may someday be 

Selected References 

1. Adkins, Dorothy C. Construction and Analysis of Achievement Tests. Washington: 
Government Printing Oifice, 1947. 

2. Englehart, Max D. "Unique Types of Achievement Test Exercises," Psycbometrika, 
7: 103-16, 1942. 

3. Hawkes, Herbert E. ; Lindquist.^E. F. ; and Mann, C. R. The Construction and Use 
of Achievement Examinations. Boston: Houghton Mifflin Co., 1936. 

4. Hoiv to Make the Picture Test Item. (DA AGO PRT-873.) Washington: Department 
of the Army, 1948. 

5. Mosier, Charles I.; Myers, M. C; and Price, Helen G. "Suggestions for the Con- 
struction of Multiple Choice Test Items," Educational and Psychological Measurement, 
5: 261-71, 1945. 

6. RucH, G. M. The Objective or New-Type Examination. Chicago: Scott, Foresman and 
Co., 1929. 

7. Scates, Douglas E. "Complexity of Test Items as a Factor in the Validity of Measure- 
ment," Journal of Educational Research, 30: 77-92, 1936. 

8. University of Chicago Board of Examinations. Manual of Examination Methods. 
2nd ed. Chicago: University of Chicago Bookstore, 1937. 

9. Weitzman, E., and McNamara, W. J. "Apt Use of the Inept Choice in Multiple 
Choice Testings," JoUrnal of Educational Research, 39: 517-22, 1946. 

8. The Experimental Tryout of 
Test Materials' 

By Herbert S. Conrad 
US. Office of Education 

Collaborators: Dorothy C. Adkins, University of North Carolina; 
Frederick B. Davis, Hunter College ; Edith M. Huddleston, Educational 
Testing Service ; Wilham G. Mollenkopf, Educational Testing Service; 
WilHam B. Schrader, Educational Testing Service 

After a set of test items has been written, criticized by subject- 
matter experts, and revised on the basis of their criticisms, it must ordi- 
narily be tried out experimentally on a sample of examinees. This sample 
should be as nearly like the population with whom the final form of the 
test is to be used as is reasonably possible. 

The various purposes which may be served by a tryout are listed below 
roughly in the order of their practical importance in educational achieve- 
ment test construction. In practice, not all of these purposes, of course, are 
served by the tryouts for all tests. Some tryouts are planned with certain of 
these purposes in mind, others with other purposes. 

1 . To identify weak or defective items and to reveal needed improve- 
ments. More specifically, to identify ambiguous items, indeterminate 
items, nonfunctioning or implausible distracters, overly difficult or overly 
easy items, and so forth. 

2. To determine the difficulty of each individual item, in order that a 
selection of items may be made that will show a distribution of item 
difficulties appropriate to the purpose of the finished test. 

3. To determine the discriminating power of each individual item, 
in order that all items selected may contribute to the central purpose of 
the finished test and together constitute an efficient measuring instrument. 

4. To provide data needed to determine how many items should 
constitute the finished test. 

5. To provide data needed to determine appropriate time limits for 
the finished test. 

' This chapter is a condensation and revision of an originally longer presentation. The ma- 
terials in this chapter do not necessarily reflect the viewpoint or policy of the Office of 
Education, Federal Security Agency. 



6. To discover weaknesses or needed improvements in the mechanics 
of test taking, in the directions to examiner and examinee, in the pro- 
visions for the responses, in the sample or fore-exercises, in the typo- 
graphical format, and so forth. 

7. To determine the intercorrelations among the items, in order to 
avoid overlap in item selection and to know how best to organize the 
items into subtests. 

The experimental tryout may be simple or it may be extremely elaborate 
and expensive. The nature of the tryout depends on the purposes for which 
the test is to be used, the nature of the items, and the amount of time, 
money, and resources available. In general, the tryout can be relatively 
simple for sets of items designed to provide parallel forms of tests already 
in use. In such cases no tryout of general format or of administrative pro- 
cedures is necessary, and the content and difficulty of the items that have 
been found satisfactory in previous forms can be duplicated rather readily. 
If, however, a new kind of test is to be constructed for an important use, 
an elaborate tryout or series of tryouts must be planned. For example, if 
novel types of test items were to be utilized for determining eligibility for 
teaching positions in several large cities, it is obvious that one would want 
to be very sure of the adequacy of the items before they were administered 
to the candidates. 

Tryout Stages 

For purposes of discussion in this chapter, it may be helpful to break the 
process of experimental tryout into three stages: the pretryout, the tryout 
proper, and the final trial administration. 


Before the items may be tried out, they must, of course, be assembled into 
tryout units, or tryout forms. By a pretryout is meant the preliminary 
administration of the tentative tryout units to small samples of examinees 
for the purpose of discovering gross deficiencies, but with no intention of 
analyzing pretryout data for individual items. The tentative tryout forms 
may be "dittoed," or may even be carbon copies of a typewritten draft. 
The pretryout may be highly informal, and may involve the administration 
of the tentative tryout units to from half a dozen to a hundred examinees 
fairly representative of the population to whom the finished test is to be 
administered. The sample may, in fact, consist merely of a few adults who 
try to "put themselves in the position" of the students for whom the test 
is intended. Such a sample is presumably better than none, though it 
certainly cannot be recommended as the best. 


Ordinarily, the test constructor will wish to administer the pretryout 
himself, since much may be learned by direct observation and by interviews 
with examinees that would be difficult to transmit in a written report 
by an assistant. Major omissions, ambiguities, or inadequacies in the 
directions to the examinee and in the sample items and fore-exercises may 
be discovered in the pretryout. Serious errors in the assembly of the test 
booklets or difficulties in the mechanics of test taking may likewise be re- 
vealed. The amount of time that should be allowed in the later tryout 
administrations of the units can be estimated with considerable accuracy, 
and this is often the major reason for a pretryout. Whether the average 
level of difficulty of the items is satisfactory can also be quickly determined 
within rough limits. If the need for a larger number of easy or difficult 
items is apparent, they can be prepared and incorporated in the revised 
tryout forms. 


Once the gross deficiencies in the tryout forms have been eliminated, 
perhaps on the basis of a pretryout, it becomes necessary to obtain accurate 
information concerning the performance of each item in a sample of 
examinees similar to those with whom the final form of the test is to be 
used. Administration of the tryout forms of the test for this purpose to 
400-500 or more examinees is regarded for the purposes of this chapter as 
constituting the tryout proper. Sometimes, the results of the first tryout 
reveal the desirability of altering some items and of replacing others. In 
these circumstances, a second tryout may be needed to ascertain the adequacy 
of the revised set of items. More than two tryouts may sometimes be deemed 
advisable, particularly if the tests are to be used to make important deci- 
sions about the examinees, and if time and resources are such as to permit 
getting proof in advance that the final form of the test will be highly effi- 
cient as well as valid. However, only very rarely is more than one tryout 
feasible in practical test construction. 

Trial Administration of the Finished Test 

On the basis of the data obtained in the tryout, the items are selected and 
assembled into the finished test. The finished test is then ready for a trial 

The trial administration, as it is defined in this chapter, serves to indicate 
exactly how the test will function in actual use. This means that no materia] 
changes can be made after the trial administration, and that the sample 
employed must be essentially like the group with whom the test is to be 


used. The trial administration is tlius a "dress rehearsal," and provides a 
final check on time limits and on the procedure of administration. 

This definition means that in many instances the first practical use of the 
finished test is actually the trial administration. 

Planning the Tryout 


Unless tryout data are based on samples of examinees essentially similar 
to those with whom the test is to be used, the data may be misleading 
rather than helpful. The indices of difiiculty and of discriminating power 
of the items, the attractiveness of the distracters, and the magnitudes of 
reliability and validity coefficients for the tryout forms are all dependent 
on the characteristics of tlie sample of examinees tested. The test construc- 
tor should try not only to obtain representative samples in order to avoid 
bias in his data, but he should also try to obtam samples that are efficient 
in the sense that they yield maximum information about the population per 
individual tested. 

It is a common mistake to judge the adequacy of a tryout sample solely in 
terms of the number of pupils tested. However, the school, as well as the 
pupil, must be taken into account. Differences in mean test achievement 
from school to school are usually far larger than would result from simple 
random sampling, and are often almost as great as differences among 
individual pupils within a single school. The same item may prove very 
difficult in one school and very easy in another, or very discriminating in 
one school and nondiscriminating in another. Even that which is measured 
by an item may differ from school to school; the same item may prove to 
be a reasoning item for pupils with one set of learning experiences, and a 
memory item for those who have had other learning experiences. For these 
reasons, the number of schools represented in the tryout is important, as 
well as the number of pupils. A tryout sample of 20,000 pupils all taken 
from the same school system will generally not serve as well as a sample 
of 400 pupils taken from many school systems. Theoretically and actually, 
a sample consisting, for example, of every twentieth child in every sixth 
grade in a state is vastly better than one of the same size consisting of every 
sixth-grade child in each one of several schools scattered throughout the 
state. Articles by Lindquist (3) and Marks (4) deal with this point. 

Practical considerations, of course, play a major part in obtaining samples. 
Schools and classes within schools constitute natural administrative units, 
and it is, therefore, ordinarily impossible to test only every tenth pupil 


or only every twentieth pupil. This means that if a certain class is to be 
used at all, every pupil in that class usually has to be tested, even though 
the resulting sample is much larger than is required. Sometimes, for the 
purpose of computing tryout data, a random group may be selected from 
among those tested, thus saving unnecessary clerical labor; but ordinarily a 
score for each pupil must be obtained for reporting to the cooperating 

Administrative Arrangements 

In general, the administrative factors to be considered in the tryout of 
the test are the same as those to be considered in the administration of the 
finished test in practical use. A thorough treatment of these factors will be 
presented in chapter 10 "Administering and Scoring the Test," and hence 
only a brief discussion of administrative arrangements is necessary here. 
The present discussion is, therefore, to be regarded only as a summary of 
most important points, and any reader who is seeknig help in the planning 
of an actual tryout should rely primarily on chapter 10 so far as administra- 
tive arrangements are concerned. 

The most important single principle to follow in planning the tryout 
is that the tryout materials should be administered under as nearly as pos- 
sible the same conditions as those under which the finished test will be 
administered. This principle should be borne in mind in relation to each 
of the following paragraphs. 

Dtstrihution of tryout materials to participating schools. — All materials 
needed in the schools participating in the tryout should be distributed to 
them well in advance of the date of the tryout administration, so that 
adequate opportunity is given for the correction of shipping errors, answer- 
ing of questions concerning administrative arrangements, and so forth. 

Examiners. — Since the care and skill with which tests are administered 
depend on the test administrators, it is obvious that a good deal of thought 
should be exercised to secure competent examiners who will conscientiously 
carry out the directions provided. Ordinarily, the purpose of the testing and 
the basic rationale of the testing procedures should be explained to each 
examiner well in advance of the tryout. 

Directions to examiner. — -The directions supplied to each examiner 
should be highly detailed and crystal clear. They should define the duties of 
proctors and examiners and include specific instructions regarding the dis- 
position or return of test materials. If sample exercises of any kind are 
included in the tryout materials, the examiner should be provided with the 
answers to such exercises to avoid the embarrassment or the errors that may 


occur as a result of haste or unpreparedness when examinees ask for the 
correct answers. The necessity for adhering rigidly to time Hmits should 
naturally be emphasized. 

Directions to examinees. — The directions to the examinees in the tryout 
forms should be as nearly as possible identical with those that are to be 
used with the finished test. If there is any doubt about the adequacy of the 
directions, they should be revised on the basis of a pretryout, rather than 
after the tryout. 

There is some disagreement among psychologists and educators, with 
reference to the regular use of finished tests, as to whether the examinees 
should be told to guess or not to guess on multiple-choice items for which 
they are not sure they know the correct answers. To a lesser extent this 
disagreement also prevails with reference to tryout forms. 

Cronbach and others ( 1 ) have pointed out that the extent to which pupils 
guess on multiple-choice tests depends not only on the level of their know- 
ledge but also to a considerable extent on personality factors which we do 
not ordinarily wish to include in our measurement. This is no doubt true, 
and has been cited as a reason for directing every examinee to mark every 
item in the tryout form, even though the time allowed is inadequate for 
careful consideration of each item. It should be pointed out, however, that 
some pupils react strongly to being forced to mark answers to items to 
which they have no idea of the correct answer, and some teachers object to 
having their pupils so directed. Thus, personality factors are apt to influence 
test scores to some extent whether the directions do or do not require the 
examinees to answer every item. 

Many psychologists recommend directions that ask the pupils to guess 
only when they have some clues and can make "shrewd guesses"; with 
such directions, a recommendation is also frequently made in favor of a 
scoring formula that corrects for chance success. As pointed out in chapter 
10, the assumptions underlying the correction for chance success are 
rarely, if indeed ever, exactly satisfied; hence, this procedure has been 
criticized, particularly in tryout situations. There is nearly universal aggree- 
ment, however, that examinees should be told clearly in the directions 
for a tryout test whether to answer every item or to guess shrewdly and 
whether the scoring formula includes a penalty for guessing incorrectly. 

Sample items. — Consideration should always be given to illustrative or 
sample items included in the directions. If the items in a test are of a kind 
familiar to the examinees, sample questions may be superfluous unless a 
"warm-up" exercise is judged essential. Sample questions or practice 
exercises may become of special value if it is desirable to control the "set" 


of the examinees with regard to speed of response. If an examinee is pre- 
sented with ten sample items and he finds that he has answered all ten 
of them in half the time allowed, he learns better than any set of verbal 
directions could make him learn that he can afford to work more slowly 
and carefully. Not all test constructors agree that sample items should be 
used for this purpose. Sometimes, unfortunately, the "set" appropriate 
for the sample items is entirely inappropriate for the content of the test 
itself, as examinees have sometimes discovered to their dismay. This situa- 
tion should be avoided. 

Reports on test administration. — Provision should always be made to 
secure from the tryout test administrators complete reports on the condi- 
tions that prevailed during the testing. Full information should be pro- 
vided regarding any interruptions or irregularities that may have occurred 
or difficulties that may have been encountered. 

Security. — The importance of making certain that every copy of a tryout 
test has been returned or accounted for varies greatly with the nature and 
purpose of the test. Sometimes, each copy of a tryout test should be num- 
bered, and test administrators should have to sign receipts for specified 
numbers of copies. This makes it possible to hold each administrator re- 
sponsible for returning or accounting for the disposition of every copy 
in his possession. Sometimes, it is desirable to issue instructions to examiners 
to inspect each test booklet when it is returned to be sure that all pages 
remain intact. Other security measures, such as the use of sealed packages, 
can be employed as needed. 

Ptipil motivation. — The level of pupil motivation is always an important 
consideration in the administration of tests, but the problem may be 
particularly troublesome with tryout tests. If the examinees are told that 
the results are to be used only for experimental purposes, their motivation 
may be adversely affected. On the other hand, it may sometimes be unfair 
to use the results of imperfect tryout forms to judge the pupils' perform- 
ance. A practical compromise may consist of telling the examinees that 
the test results will be included in their records for such use as seems 

Reports to cooperating institutions . — Almost invariably the scores of all 
examinees should be reported to them or to their teachers within a reason- 
able period of time. Knowledge that this will be done is an important 
factor in securing good motivation among the group tested and a high 
degree of cooperation among the teachers and administrators who have 
arranged for the testing. Effective means for providing prompt, convenient 
reports of the results of testing should always be worked out in advance 


and adhered to with scrupulous care. Too often, such reports are neglected 
or are provided only after interest in and uses for the scores have long 
since disappeared. 

Securing reactions of examinees. — Provision should be made for noting 
down comments by examinees regarding the directions and content of 
the pretryout and the tryout forms, and such comment should be carefully 
considered in revising the test. Inadequate directions and faulty items are 
often identified in this way. 

Undesirable attitudes toward schools and especially toward examinations 
have undoubtedly been engendered in students by correctible faults in 
examinations or in the administrative procedures connected with them. 
Unsophisticated test constructors sometimes tend to discount criticisms of 
examinees (and even of subject-matter authorities) when statistical data 
indicate no apparent basis for their criticisms. This is an unfortunate tend- 
ency because the criteria for item analysis typically leave much to be de- 
sired. Although item analysis data are often exceedingly helpful in test 
construction, they cannot and must not be relied upon to detect all the 
ambiguities and other faults of items. For various reasons, then, test con- 
structors should accept with proper humility the comments of the ex- 

Scoring procedure. — The scoring procedures for use in each tryout stage 
and with the final form of a test should be worked out carefully before 
the tryout forms are reproduced because the requirements of quick and 
easy scoring sometimes influence the arrangement of items. For example, 
forethought in planning tryout tests may make possible obtaining separate 
scores for as many as six groups of items from only one insertion of an 
answer sheet in the IBM test-scoring machine. 

Scoring formulas should be selected to yield maximum validity with 
minimum labor. They must ordinarily be coordinated with the directions 
and timing for the test; thus, if the examinees are told to guess, but are 
given insufficient time to answer all items, a correction for chance may be 
called for. Various scoring formulas are presented and discussed in chap- 
ter 10. Sometimes data from the trial administration should be used to 
calculate the relative weights for "right" and "wrong" answers that will 
cause scores to yield maximum correlations with designated criterion vari- 
ables [2, pp. 458-61). Convenient approximations to the optimal weights 
can then be incorporated in the scoring formulas for the final form of 
the test. 

As stated in chapters 9 and 10, many test technicians recommend cor- 
rection for chance in scoring. Some test technicians recommend the use of 


correction for chance in obtaining the scores to be used as the criterion 
for internal-consistency item analyses even if the item analysis data for 
each item are not so corrected and if the final form of the test is not to 
be scored with correction for chance. This is particularly important if not 
every examinee in the tryout sample is able to reach nearly every item 
within the time limit. Correction for chance in obtaining criterion scores 
may sometimes remove contaminating influences (such as rate of work 
and some personality differences) from the criterion variable. 

Test Materials 

Layout. — Careful planning of the arrangement of items within a tryout 
test must precede reproduction of the copy. Many of the mechanical details 
pertaining to this problem are discussed in chapter 11. Here, we should 
consider some more fundamental decisions. 

Insuring adequate tryout of all items. — Since one of the purposes of a 
tryout is to gather accurate data about each individual item, it is extremely 
important that the time limits be so generous as to permit all, or nearly 
all, of the examinees to attempt to answer every item on which tryout data 
are desired. This poses a real problem when the items are to be used in the 
final form of a speeded test, that is, one so timed that a considerable 
proportion of examinees do not have time to finish. If the experimental 
tryout were conducted under speeded conditions, the items toward the 
end could not be attempted by a large proportion of the examinees; but 
if the tryout were conducted with generous time limits, the mental-set and 
rate of work of the examinees would be so different from those likely 
to obtain during administration of the final form of the test that the tryout 
data might not be useful. 

Four ways of avoiding this dilemma have been widely employed. Under 
the first method, the items to be tried out under speeded conditions are 
followed in the test booklet by a set of "cushion" items of the same gen- 
eral type as the others but that are neither to be analyzed nor scored. These 
items should be relatively difficult and time-consuming, since their only 
purpose is to keep the faster or abler examinees occupied during the latter 
part of the test period and thus prevent them from distracting other 
examinees or creating a disciplinary problem. Under the second method, 
the items to be tried out are placed in different orders in two or more 
tryout booklets. That is, if 30 items are to be tried out, items 11-20 in 
booklet A appear as items 1-10 in booklet B, and items 21-30 in booklet 
A appear as items 1-10 in booklet C. Thus, every item appears among 
the first ten in at least one of the tryout booklets, and item analysis data 


may be computed only for the first 20 items in eacli booklet. The second 
method is essentially the same as the first — the cushion items in each 
form consisting of some of the items to be tried out. The second method 
is, of course, the more expensive, since it requires the printing of several 
tryout forms for the same items, and utilizes only a part of the total try- 
out sample on each item. 

Under the third method, only the items to be tried out constitute the 
tryout form(s), the items are administered with directions to guess 
shrewdly but to avoid blind guessing, and the time limit is set so that 
only a small proportion of the examinees will not have time to finish. An 
effort is then made to utilize the tryout data for the end items — those not 
attempted by all examinees — by making various corrections for guessing 
and for "non-attempts" in the difficulty and discriminating indices for 
these items. Corrections of this character are presented in chapter 9 fol- 

The fourth method is especially applicable when items designed for 
parallel forms of tests administered as part of an organized program are 
to be tried out. It consists of interspersing new items in current final forms 
or of preparing special tryout booklets for administration along with cur- 
rent final forms. In the National Teacher Examinations, for example, the 
vocabulary section originally consisted of 90 items of which 10 (comprising 
one column on one page) were tryout items that were not scored with 
the remaining 80. Instead, scores on the 80-item current form served as a 
criterion variable for item analysis of each of the 10 tryout items. This 
procedure has the merit of removing the spurious element ordinarily 
present in internal-consistency item analysis data. By making up 10 differ- 
ent sets of tryout items and running each of them in for one-tenth of the 
total printing, 100 items were tried out each year to provide an 80-item 
final form for the following year. This is obviously an extremely expen- 
sive procedure, and for that reason is rarely practicable. 

Most tests of educational achievement are not highly speeded, but are 
administered with liberal time limits so as to place the major emphasis on 
"level" or "power" rather than on rate of work. In general, therefore, 
the best procedure is to provide t'ery l/beral time limits for the tryout units, 
so that nearly all pupils have time to consider (not necessarily to mark) 
all items that are being tried out, and then to use one of the first two 
methods just described to provide for the small proportion of pupils who 
do not finish the test. In most cases, the first and simpler of these methods 
should prove adequate. 

Provision for criterion measures. — A basic distinction in types of try- 


outs is that concerned with the provisions for a criterion in terms of whicii 
the individual items are to be judged. In general, the best educational 
achievement tests themselves constitute the best available definition of the 
type of behavior being measured, and no external criterion currently exists 
that measures this behavior better than the score on the test itself (see 
pages 146-48). The items must then be evaluated against an internal 
criterion, consisting either of the score on the tryout form itself, or the 
score on an independently administered test like that in which the items 
being tried out are to be employed. Suppose, for example, that new items 
are being tried out for the construction of a new form C of a test for 
which forms A and B are already available. One possibility is to administer 
the new items in one or more tryout forms, each form being administered 
to a different sample, and to use the total score on each tryout form as the 
criterion in terms of which to compute the discriminating power of each 
item in that form. The other possibility is to administer the tryout items 
in one or more tryout forms, each to a different sample, to administer 
either form A or B to all samples, and then to use the score on form A 
(or B) as the criterion in terms of which to judge all tryout items. 

If the first alternative is employed, it is essential that each tryout form 
be long enough to yield a reliable internal criterion. Sometimes it may be 
possible to administer all items to be tried out in a single tryout form, 
so that all items are taken by the same examinees. This, of course, insures 
the maximum reliability in the internal criterion, and also makes it pos- 
sible, if desired, to compute intercorrelations among all of the items. 
However, it is frequently difficult to induce cooperating schools to ad- 
minister large tryout units, or it may be otherwise undesirable to do so, 
for example, because of fatigue or motivation factors. 

If the second alternative is followed, the tryout forms may be of any 
length, and administrative convenience is usually the major factor in de- 
termining their length. Consideration should be given, however, to the 
possibility that the tryout forms may be so short in comparison with the 
finished test that the attitude of the examinees toward the tryout test (moti^ 
vation and fatigue factors) may differ too markedly from that which will 
characterize the finished test. Furthermore, the use of many short tryout 
units independently administered to different samples precludes the pos- 
sibility of securing item intercorrelation data. 

A good example of the use of the second of these alternative procedures 
is provided by the Iowa Basic Skills testing program for grades 3-8. In 
this program, an entirely new edition of tests has been provided each year, 
and the tryout of new test materials for the succeeding program has been 


made an integral feature of each annual testing program. This has been 
done by assembling the tryout items each year into self-administering 
experimental units, each of which constitutes a miniature power test with 
a time limit of twenty minutes, and each of which corresponds to one of 
the tests in the regular battery being concurrently administered. These 
units are distributed in rotation so that every pupil in a class of 50 may 
he taking a different experimental unit. In a testing program involving 
70,000 pupils, it is possible to secure 140 samples of 500 pupils each and, 
correspondingly, to try out 140 twenty-minute units in these samples. To 
try out all of these items with a single group of pupils would require over 
40 hours of their time. Furthermore, the 500 pupils in each sample are 
distributed proportionately among all schools participating in the program, 
so each sample is apt to be highly representative of the entire group and 
thus similar to all other samples. 

The criterion measure for each experimental unit, of course, is the score 
in the corresponding test in the finished battery being used in the regular 
program. The criterion, then, is undiluted by weak or defective tryout 
items, and there is no issue of self-correlation so far as the tryout items 
are concerned. This method minimizes the difficulty of getting examinees 
for experimental purposes, eliminates the need for special reports to co- 
operating institutions, and maximizes the representativeness of the tryout 
samples. Every effort should be made to employ such methods whenever 
they are feasible. 

When the items to be tried out are organized into a number of tryout 
forms, each of which is to be administered to a separate sample, it is ex- 
tremely important that highly comparable samples be employed. This will 
almost certainly not be true if the different samples are made up of differ- 
ent classes and schools. The only satisfactory procedure, generally, is to 
administer the various forms simultaneously to different randomly selected 
pupils (or, better, matched pupils) in the same class or group, so that all 
classes and schools are proportionally represented in each sample. 

Order of difficulty of ite?ns. — To avoid any possibility of confusing or 
discouraging the examinees early in the tryout period, easy items or familiar 
types should be placed near the beginning of a test, and unfamiliar, un- 
usual, difficult, or particularly complex types of items should be placed 
at the end of a tryout test. 

Surplus items for discard. — It is always necessary to try out more items 
than will be needed for the finished tests. There is no universally applic- 
able rule regarding the margin for discard that must be allowed when a 
set of items is tried out. It is safe to say that the margin can be minimized 


when the items are constructed to provide another form of an existing test, 
when the items are factual in content or are concerned with narrow and 
well-defined skills, when the standard of excellence to be met is not 
specially high, and when the tryout sample is fairly large and highly 
representative of the group to which the final form of the test will be 
administered. For vocabulary items tried out in the National Teacher 
Examinations, a margin for discard of 20 percent was more than adequate. 
In fact, it was unusual for any vocabulary item in the tryout test to prove 
unsatisfactory for use in the final form of the test. 

In general, the more complex the items or the mental functions tested 
by them, the larger the margin for discard must be. The margin must also 
be increased as the number of separate categories of items is iricreased. 
This is a consequence of the fact that the percentage of acceptable items 
within a given category is subject to sampling fluctuations. Suppose that 
the outline for a history test calls for four items in the category of "names 
in the news." Even if in the long run about two-thirds of items of this 
kind that are tried out prove acceptable, it would not be prudent to try 
out only six items in this category. By mere chance it might be that only 
two or three of them would prove acceptable. A smaller margin of discard 
is adequate in the case of categories in the outline that call for a consider- 
ably larger number of items because the effect of chance will be cor- 
respondingly reduced. 

The principle stated above also applies to item categories in terms of 
difficulty. Ordinarily, only a small proportion of the items in the final 
form of a test are intended to be exceptionally easy or difficult. For this 
reason, we are likely to include only a small proportion of exceptionally 
easy or difficult items in our tryout test. Thus, we are dealing with cate- 
gories that include only a few items, and we should allow for a larger 
margin of discard. One reason for the noticeably high mortality of very 
difficult items is the greater role of chance in determining responses to 
these items. There is naturally a greater amount of guessing on such items 
than on easy items. 

The excess of tryout items over items needed for the finished test also 
depends on the competence of the item writer, or upon the quality of the 
tryout items. Of two item writers constructing items for the same test, 
one may produce a much larger proportion of defective items than the 
other. The individual planning the tryout must, therefore, depend to a 
considerable extent on his own evaluation of the items in deciding how 
many to try out. 

Organization into subtests. — Because the criterion variable for an in- 


ternal-consistency item analysis should ordinarily be as homogeneous as 
possible, items constructed for an achievement test are sometimes tried out 
in separate groups, the total score on each group being used as the criterion 
for obtaining item analysis data for each item within the group. In con- 
structing a test of college physics, for example, one might try out separate 
groups of items in mechanics, heat, electricity, and so forth. This is usually 
desirable even though all types of items are later to be intermingled in 
the finished test and all contribute to a single score. It is obviously im- 
portant that each group be large enough to yield a reliable criterion score, 
and that all groups be tried out on highly comparable samples. 

Data Obtained through the Tryout 

Concerning the mechanics of test taking. — Information regarding the 
adequacy of directions to the examiner or examinee is obtained mainly 
from the reports of the test administrators in the pretryout and tryout, 
preferably the former. Sometimes the directions for the pretryout form are 
so incomplete or misleading as seriously to affect the examinees' scores. 
Blank ansvv^er sheets, indicating failure to mark answers to any items, point 
either to inadequate directions or to lack of moderately easy items. Failure 
of examinees to go beyond a certain item may be found to result from lack 
of directions to turn the page or to go on to the next part, and the like. 

Reports from test administrators of the pretryout (and tryout) likewise 
provide most of the data regarding the suitability of practice exercises or 
sample items. Revision of the latter may be required to improve their 
length or level of difficulty. Adjustment of the time limits for practice 
exercises is commonly made on the basis of data obtained in the pretryout. 

Examiners and proctors at tryout administrations should observe care- 
fully the use made of scratch paper, when provided. Scratch paper should 
always be collected with the answer sheets if test security is to be assured 
and should ordinarily be inspected later in order to find what use was made 
of it. The inspection of test booklets, which is essential if they are to be 
reused, often reveals a need for scratch paper when it has not been provided 
(the margins and other spaces being found to have been filled with jottings). 

If the advice concerning test format that is provided in chapter 11 is 
followed with the tryout forms, there will be little need to modify the jor- 
mat of the finished test on the basis of tryout data. (When any significant 
change in format is made, a subsequent tryout in the new format is usually 
necessary.) The examiners and proctors should always be alert, however, 
to detect instances where the test exercises have been inconveniently ar- 
ranged or illegibly reproduced. Sometimes, maps or passages necessary 


to answering certain items may be placed in such a way as to require 
constant turning back and forth of pages. Inconveniences of this sort 
should be eliminated in final printings. 

Concerning time requireynents. — In general, a test constructor either ( 1 ) 
builds his test to satisfy certain standards of reliability or validity, or both, 
or (2) he constructs a test to be administered in a predetermined time 
limit, such as a forty-minute class period. In the first case, he must first 
determine from the tryout what is the reliability or validity of the tryout 
form. Given this information, by means of the formulas for estimating the 
reliability and validity of lengthened tests, he can estimate roughly how 
many items like those in the tryout form will be needed to yield the desired 
reliability or validity. Then, having also determined from the tryout how 
much time per item on the average is required by the examinees, he can 
compute what time limit is appropriate for the required number of items 
(subject to later check in the trial administration). In the second case, in 
which a predetermined time interval is to be employed, he must again 
determine from the tryout what is the average time per item that is re- 
quired, and can then determine how many items similar to those in the 
tryout can be administered in the allotted time. 

In either case the test constructor must determine from the tryout what 
is the average time per item required by the examinees. In general, the 
best way to do this is to interrupt the examinees at certain times during the 
tryout period, and request them to mark in a characteristic way the item 
on which each is working at that instant, having first directed them^ to 
answer all questions in the order in which they are printed. For example, 
in a thirty-minute tryout period, the examiner may interrupt at the end of 
fifteen minutes to say, "On your answer sheet, please draw a circle around 
the number of the item on which you are now working." Five minutes 
later they may be asked to draw a triangle around the number of the item 
on which they are then working, and five minutes later to draw a square 
around it. It will then be easy, later, to compute the average number of 
items considered in fifteen niinutes, or in twenty minutes, and so forth, 
or to compute any percentile in the distribution of numbers of items con- 
sidered in any given interval. From such data, the test constructor may 
readily determine how many items similar to those in the tryout will be 
completed by any desired proportion of the examinees in any given time 

Inasmuch as the finished form consists of selected items from the try- 
out form, and since the selected items may differ systematically in their 
time requirements from those discarded, it is often desirable to determine 


the final time limits in a trial administration of the finished test, using 
the procedure just described. Where an external criterion is available, 
there are still better methods for determining an optimum time limit for 
the finished test (see pages 336-38). The factors to be considered in 
determining time limits are more thoroughly discussed in chapter 10. 

Concerning reliability and validity. — It is often desirable to compute 
the coefiicients of reliability (and of validity, if possible) for the tryout 
forms, particularly if the finished test is to satisfy certain standards of 
reliability (or validity). Appropriate methods of computing these coeffi- 
cients are presented in chapters 15 and 16. Given the reliability (or valid- 
ity) coefficient of the tryout forms, appropriate formulas (pages 581 and 
608) may be employed to estimate the reliability (or validity) of the finished 
test, if the number of items is predetermined, or to estimate how many 
items will be required in the finished test to produce the desired reliability 
(or validity). Since the finished test will be made up of selected items, 
which presumably are more reliable and valid than the typical item in the 
tryout form, these estimates will usually be biased, but in the direction 
of conservativeness. 

Concerning difficulty and discriminating power of individual items. — 
The major reason for the experimental tryout is to secure quantitative 
measures or indices of the difficulty and discriminating power of individ- 
ual items. Appropriate indices and ways of computing them, as well as 
suggestions for their use in test revision, are presented in the following 

Selected References 

1. Cronbach, L. J. "Response Sets and Test Validity," Educational and Psychological 
Measurement. 6: 475-94, 1946. 

2. Guilford, J. P., and Lacf.y, J. I. (eds.). Printed Classification Tests. Washington: 
Government Printing Office, 1947. 

3. LiNDQUiST, E. F. "Sampling in Educational Research," Jot/rnal of Educational Psy- 
chology, 31: 561-74, 1940. 

4. Marks, Elias. "Sampling in the Revision of the Stanford-Binet Scale," Journal of Edu- 
cational Psychology, 44: 413-34, 1947. 

9. Item Selection Techniques 

By Frederick B. Davis 
Hunter College 

Collaborators: Dorothy C. Adkins, University of North Carolina; 
Herbert S. Conrad, U.S. Office of Education; Edward E. Cureton, 
University of Tennessee ; William G. Mollenkopf, Educational Testing 
Service; Julian C. Stanley, George Peabody College for Teachers 


building an objective test is to try out more items than will be needed in the 
final form. Ingenious statistical techniques to aid in selecting the items for 
the final form have been developed, but the precise circumstances in which 
these techniques are useful are not widely understood and they have some- 
times been misused. A major emphasis in this chapter will, therefore, be 
placed on when and how to use them. 

By the time the tryout forms for a test have been constructed, much of 
the opportunity for selecting items has already passed. The first and 
fundamentally most important steps in selecting the items for almost any 
test are taken when an outline is constructed to determine its content and 
the items are prepared to measure the skills and abilities included. This 
outline should ordinarily be constructed on the basis of the judgment of a 
group of curriculum authorities and subject-matter experts and should 
indicate the weight to be given to each topic. The writing and editing of 
test items call for a rare combination of psychological insight, facility in 
written expression, and knowledge of the subject. Because item writing 
and editing are difficult and the mechanical use of item analysis data is 
relatively easy, there has been an unfortunate tendency to minimize the 
former and rely heavily on the latter. It cannot be emphasized too strongly 
that statistical data are at best merely a valuable guide in putting a test 
together and cannot take the place of scholarship, ingenuity, and painstak- 
ing effort on the part of the item writer. 

Selecting Items after the Experimental Tryout 

The difficulty of individual test items is not an important consideration 
when items are selected for mastery tests or, under some circumstances, 
for predictor tests. Mastery tests are not intended to provide scores that 



will rank students in terms of their knowledge or ability; rather they are 
designed to separate students into two groups: those who know certain 
basic facts, principles, or operations, and those who do not know them. 
For this reason, the items in a mastery test are so chosen that nearly every 
pupil who has reached a predetermined level of achievement can answer 
them all correctly. 

Predictor tests are constructed in such a way as to maximize their corre- 
lations with specified criterion variables. Ideally, if samples large enough 
and representative enough could be used, product-moment correlation coeffi- 
cients among all the items and between each item and the criterion would 
be computed and items would be selected for use in the final form of the 
test if they had multiple regression weights significantly different from 
zero at a specified level of confidence. Under these circumstances, the 
difficulty of each individual item would not have to be considered separately 
since it would automatically play its part in determining the item correla- 
tions. In actual practice, however, samples large enough to establish the 
existence of non-chance unique variance in every one or in nearly every 
one of a set of test items (assuming that it actually exists) cannot be used. 
Hence, this procedure is rarely defensible, and methods of item selection 
that involve the separate use of item difficulty indices should be employed 
in the construction of predictor tests. 

However, most educational tests are neither mastery tests nor predictor 
tests, and the use of item difficulty indices is a matter of considerable 
importance in their construction. For this reason, a discussion of the various 
methods used to determine item difficulty is appropriate at this point. 

Indices of Item Difficulty 

Many ways of expressing the difficulty level of an item have been pro- 
posed. The most obvious of these is the percent of the tryout group that 
marks it correctly. A serious disadvantage of this procedure is that it 
makes no allowance for the fact that, when the items are of the multiple- 
choice type, some examinees may guess blindly among the choices pre- 
sented and mark the correct answer by chance alone. When the number of 
choices in the item is reduced to the minimum of two, this disadvantage 
becomes more serious. On a two-choice item, an examinee has a fifty-fifty 
chance of answering correctly even if he guesses blindly. To cope with 
this troublesome problem, a correction for chance has been proposed.^ 
There is, however, rather sharp disagreement among test technicians re- 

^ For a derivation of the formula commonly used to correct for chance success see 
"Selected References," p. 325, item 33. 


garding its appropriateness. For this reason, both points of view Mall be 
presented and the reader may form his own judgment. 

Correction for Chance Success in Computing 
Item Analysis Data 

The principal assumption involved in making use of the conventional 
correction for chance success is that examinees who do not possess enough 
knowledge to permit them to select the correct answer will guess blindly 
among all the choices (correct and incorrect) in the item. The extent to 
which this assumption is satisfied in actual practice varies with the nature 
of the items and of the groups tested. To the extent that examinees choose 
incorrect answers on the basis of misinformation rather than on the basis 
of blind guessing, the procedure overcorrects for chance success; to the 
extent that they can eliminate from consideration incorrect choices on the 
basis of partial information that is correct but not adequate to permit 
identification of the correct choice, the procedure undercorrects for chance 

Most test constructors agree that few examinees actually guess blindly 
among all of the choices in any item they have time to consider. Some of 
the examinees will ordinarily select an incorrect choice in the belief that 
it is actually correct. They are acting on the basis of misinformation. Others, 
probably often constituting a majority of those who mark the item incor- 
rectly, have a certain amount of valid information about the point being 
tested and can eliminate one or more of the incorrect choices from con- 
sideration. The more perceptive of them apply every kind of criterion to 
the remaining choices. "What does the examiner want me to answer? What 
clues to the right answer are there.''" are typical of the questions examinees 
try to answer in order to mark the correct choice. Finally, a preference 
among the choices that cannot be eliminated from consideration is often 
made on the basis of some factor such as the relative lengths of the choices, 
or the precision of the language in which they are expressed, or the dis- 
tribution of responses among the choices in the items already answered. 
In a well-constructed test the relationship between these irrelevant con- 
siderations and the correct answers to the items is zero, or even slightly 
negative. Hence, the examinees will do no better than chance would per- 
mit when they make use of these non-chance but irrelevant considerations. 

This analysis of the reactions of examinees who cannot identify the 
answers to test items is based on observation, on introspections reported 
by examinees, and on certain observable test results. It is, of course, over- 
simplified; in practice, the factors that underlie the behavior of examinees 


are woven into complex patterns. Among the observable test results that 
offer clues to the reasons underlying the responses to items are the nega- 
tive percentages sometimes found after correction for chance success and 
the great differences in the attractiveness of the distracters in most items. 
Almost any extensive file of item analysis data contains items that are an- 
swered correctly by less than the most probable proportion of the examinees 
that would answer correctly by chance alone. If the number of items an- 
swered correctly by less than the most probable proportion of the examinees 
is significantly larger (at, say, the 5 -percent level of confidence) than 
would appear by chance, this constitutes preliminary evidence that the 
examinees marked some items largely on the basis of misinformation. 

If examinees who are not able to select the correct answer to an item 
were to guess blindly among all the choices, the incorrect choices would 
be chosen by insignificantly different proportions of examinees. In this 
case, the proportion of examinees who marked the correct answer by 
chance alone would presumably be closely approximated by the average 
of the proportions marking the incorrect choices. Anyone who has examined 
item analysis data, however, knows that only rarely are the incorrect 
choices in an item found to be equally attractive to the examinees. Pre- 
sumably, this indicates that examinees who do not know the correct 
answer are responding not merely on the basis of blind guessing but, 
to some extent, on the basis of partial information and misinformation. 
Unfortunately, there is no straightforward way of estimating the propor- 
tion of the examinees who marked the correct answer by chance alone 
when (as is usually the case) misinformation as well as partial information 
and blind guessing plays a part in determining the examinees' responses. 
If we ignore the effect of misinformation, however, such an estimate can 
be made. Say, for example, that for a five-choice item 20 percent of the 
examinees are able to select the correct answer (choice B) on the basis of 
the information they possess and that 10 percent are so nearly uninformed 
that they can respond only on the basis of a blind guess (or resort to 
irrelevant non-chance factors uncorrelated with the function measured 
by the item). In this situation they will distribute their marks equally 
among the five choices in accordance with chance expectancy. Another 
20 percent of the examinees are sufficiently informed to be able to eliminate 
choice E but cannot discriminate among the other choices; so they distribute 
their responses equally among choices A, B, C, and D. Another 18 percent 
are able to eliminate choices A and E, and they distribute their responses 
evenly over choices B, C, and D. The remaining 32 percent are able to elimi- 
nate choices A, E, and C but do not possess enough information to discrimi- 


nate between choices B and D. Hence, they split their responses between 
these two choices. The item analysis will then show the following distribu- 
tion of responses: 


Choice A 7 

Choice B 49 

Choice C 13 

Choice D 29 

Choice E 2 

Total 100 

Now what part of the 49 percent who marked the item correctly did 
so by chance? We postulated that only 20 percent knew the answer; there- 
fore, 29 percent must have marked choice B by chance alone. Note that 
this is exactly the percentage that marked the most popular incorrect 
choice. That these two percentages are identical is not a coincidence. Horst 
has shown mathematically that in the case of a multiple-choice item where 
the correct answer is the most popular choice and the choices constitute a 
graded series of steps of knowledge about the point being tested, the percent 
of the examinees selecting the most popular incorrect choice is equal to 
the percent marking the correct answer by pure chance. By this restriction 
Horst rules out misinformation [18). 

The point of any correction for chance success is to subtract from the 
percent marking an item correctly the percent attributable merely to chance 
or to non-chance factors having zero relationship to the function measured. - 
Yet we have shown that the percent to be subtracted cannot be estimated 
precisely under the conditions that most often obtain. 

The conventional correction for chance success leads to the following 
formula for computing the percent of successes on a given test item that is 
read by every examinee: 

where Rp = the percent who answer correctly, 
Wp = the percent who answer incorrectly, 
k = the number of choices in the item. 

When this formula is applied to the data for the item given above, we 
obtain for the corrected percentage: 


49 = 49- 12f = 36i. 


^ Our discussion of correction for chance success does not consider the problem of 
assigning optimal weight to error scores when both "rights" and "wrongs" are entered 
into a multiple regression equation yielding scores maximally correlated with a set of 
criterion scores. 


Note that we subtract only 12% percent though 29 percent did select 
choice B by chance alone. We have greatly undercorrected for chance 
success when we used the conventional correction procedure. This will 
always be the case when misinformation has played no part in determining 
the examinees' responses to the item and when the incorrect choices are 
unequally attractive because of the operation of partial information. 

In actual practice, misinformation also operates to determine in some 
degree the examinees' responses. As will be apparent on a moment's 
reflection, the presence of misinformation causes the conventional procedure 
for correcting for chance success to overcorrect. Hence, the two factors 
other than chance that operate to determine the responses of examinees 
who cannot identify the correct answer tend to counterbalance each other 
in their effects on the conventional correction for chance success. Whether 
partial information or misinformation is the more important in determining 
the examinees' responses varies from item to item and from group to group 
of examinees. 

In view of the lack of analytic precision in the conventional correction 
for chance success, it is not surprising that there is some disagreement 
among test constructors about whether to make use of it in computing item 
analysis data. Test technicians who oppose its use for this purpose contend 
that blind guessing on the part of the examinees may be reduced to 
negligible proportions by : 

1. warning pupils against it, 

2. constructing items in such a way as to include in their distracters 
bits of misinformation or of partial information about the subject 
matter that are known to be in common circulation among the 

3. allowing sufficient time for every pupil to consider all the items 
being tried out. 

Whether this contention is correct cannot be completely proved or dis- 
proved, but it seems reasonable that these steps would tend to reduce the 
amount of blind guessing. Let us consider the practical implications of 
these three suggestions. 

Warning pupils against blind guessing can be accomplished in several 
ways. For example, the directions to examinees might state, "You may 
answer questions even when you are not sure that your answers are correct, 
but you should avoid wild guessing." This statement affects examinees in 
different ways. Those who try to follow the admonition must decide how 
little confidence they can have in an answer before it becomes a wild 
guess. Naturally, the more conscientious and timid examinees will omit 



items more often than will others. Some examinees will deliberately answer 
all items if they think that the scoring system provides no larger penalty 
for guessing wrong than for omitting an item. Thus, the bolder, more 
sophisticated, and less conscientious examinees may enjoy an advantage 
deriving from their personality characteristics and not at all from their 
understanding of the subject matter being tested. 

Many test constructors and teachers object to this possible outcome and 
urge that a penalty for errors be included in the scoring system so that 
examinees who guess freely or who callously ignore the directions will 
not enjoy this advantage over those who omit items that they feel they can- 
not answer with at least a shrewd guess. To provide this penalty, the 
conventional correction for chance success may be employed, and the 
directions may be modified to read, "You may answer questions even 
when you are not sure that your answers are correct, but you should avoid 
wild guessing. Your score will be the number of items you mark correctly 
minus a fraction of the number you mark incorrectly." 

Occasionally it has been suggested that these directions be used even 
when no penalty for errors is employed; more often, it has been suggested 
that the directions imply a penalty for errors when none is to be used. 
Experienced test constructors reject these suggestions because the effect on 
pupils and on subsequent test administration is serious if the ruse is 
discovered, as it almost invariably will be. 

Constructing items in such a way that the distracters will be plausible is 
universally regarded as desirable, because it tends to prevent examinees 
with only partial information from eliminating certain choices and thus 
improving their chances of guessing correctly and because conscientious 
examinees who lack sufficient knowledge to select the correct answer may 
find distracters that suggest bits of misinformation in their possession and 
mark those distracters. In addition, the test constructor should make the 
correct answers unlike stereotyped responses that some examinees may 
have memorized without genuine understanding. How successful item 
writers are in attaining these goals depends on a number of factors, such 
as their skill and the nature of the subject matter. To the extent that 
plausibility in distracters is obtained by incorporating popular misinforma- 
tion into them, blind guessing tends to be reduced; to the extent that 
plausibility in distracters is obtained by making them only slightly incor- 
rect, the effectiveness but not the amount of blind guessing tends to be 
reduced. That is, if all the distracters in an item are so plausible that 
an examinee's partial information is not adequate to rule out any of them, 
he will have to guess among all of the choices or omit the item. Thus, as 


the plausibility of the distracters is increased by making them only slightly 
incorrect, the conventional correction for chance success becomes more 
nearly applicable, whereas if the plausibility of distracters is increased by 
incorporating misinformation into them, the conventional correction for 
chance success becomes less applicable. 

To allow sufficient time for every examinee to consider every item in a 
tryout test is a worthy objective but one that is harder to attain, in actual 
practice, than one might at first suppose. Let us consider why it is ordinarily 
so desirable for every pupil to have time to consider every item. First, the 
reliability of the item analysis data pertaining to any given item is a func- 
tion of the number of examinees who mark it; second, the functions meas- 
ured by items for which it is appropriate to obtain individual item analysis 
data are not ordinarily speeded functions but are power functions; third, 
the behavior of examinees who know they are faced with more items than 
they can seriously consider within the time limit varies considerably. The 
bolder, less conscientious, and more sophisticated of them are apt to spend 
the last few minutes of the time allotted in marking answers to all the 
items they have not had time to consider. This is most undesirable because 
it introduces chance variance inextricably into the scores and because there 
is no entirely satisfactory way of correcting for chance success to prevent 
examinees who do this from benefiting from it. 

If no penalty for marking items incorrectly is imposed, it is obvious that 
examinees who rush through all the items they do not have time to con- 
sider, marking almost at random, will have an advantage over those who 
follow directions and do not mark items they have not read. Theoretically, 
the best way to prevent this is to provide enough time for every pupil to 
consider every item. It has sometimes been suggested that no time limit 
be set and that pupils be allowed to leave as soon as they have tried every 
item. The difficulty is that, when this procedure is followed, some of the 
pupils will work hastily and carelessly, marking their answers almost at 
random in order to get away at the earliest moment. This soon becomes 
evident to other pupils and tends to lower the morale of the group. The 
confusion caused as pupils leave individually or in small groups is also 
troublesome; even well-intentioned pupils produce noise and distraction 
if they leave while others are working. 

If a tryout test is administered with no time limit, all of the pupils being 
held until everyone has tried every item, behavior disorders are apt to occur 
even in well-managed schools where the morale is ordinarily good and the 
pupils cooperative. Experience shows that typical tryout tests of thirty to 
forty minutes' duration should contain enough items to permit about 90 


percent of the examinees to consider every one. The time limit should be 
stated in advance, and no pupil should be allov^ed to leave before the end 
of the testing period except for unavoidable reasons. If a suitable time of 
day is chosen, this practice vv^iU ordinarily provide about as complete data 
as can be obtained v^ithout noticeable deterioration of pupil morale. With 
shorter tryout tests, the time limits can usually be set to permit more than 
90 percent of the pupils to consider every item without exceeding the limits 
of patience for some pupils. 

In these circumstances, the test constructor has to decide whether it is 
more unsatisfactory to use a correction for chance success or to allow the 
possibility that some of the examinees will disobey instructions and thereby 
gain an undeserved advantage. This decision is influenced by the probable 
appropriateness of the correction for chance success, the proportion of 
examinees who have probably marked items they have not seriously con- 
sidered, and the extent to which these pupils will recognize that they have 
been able to beat the game. 

Even if enough time is provided so that every examinee is able to con- 
sider every item, the proportion of items left unmarked varies from pupil 
to pupil. This variation is the result partly of differences in the amounts 
of knowledge possessed by the pupils and partly of difl^erences in their 
personalities. To eliminate these personality factors, it has been proposed 
that pupils be required to mark an answer to every item in a tryout test. 
Directions to accomplish this may read, "Before handing in your test book- 
let and answer sheet, be sure to mark an answer to every question, even if 
you have not had time to consider it or if you feel completely unable to 
answer it correctly." 

Even if these directions are employed, pupils will not usually follow 
them unless their answer sheets are inspected before acceptance and they 
are forced to mark an answer to every item. At one time, the writer ad- 
ministered a test of reading comprehension to 541 teachers-college fresh- 
men with directions to take all the time needed and to mark every item 
even if this meant guessing on some items. The students worked on the 
items in good spirit and apparently cooperated well, but the answer sheets 
were not inspected before acceptance and it was later found that only 421 
students had actually marked an answer to every item. To force students 
to mark answers to items based on reading passages available for reference 
is undesirable, but to force them to mark answers to items testing specific 
subject matter that they do not know and cannot figure out is far worse. 
It is not only frustrating to the students but it goes contrary to good teaching 
practices and compels the students to break habits of carefulness that the 


schools try hard to inculcate. Once in a while it is possible that students 
might be told that as part of an experiment they are required to mark an 
answer to every item in a test even when they have no idea what to mark, 
but in systematic testing programs this would be inadvisable as well as 
impractical. It might eliminate variations in the number of omissions and 
thus wipe out some of the effects of differences in personality, but it would 
do this at a cost of antagonizing teachers and frustrating students. It would 
also introduce additional chance variance into the scores. 

Correction for chance success is sometimes recommended partly on the 
ground that the percentages of success on various items will be comparable 
even if the items include different numbers of choices. The extent to which 
they actually will be comparable after correction for chance depends on the 
appropriateness of the basic assumption underlying correction for chance 
success. If the examinees answer the items luithout resorting to blind 
guessing or to non-chance factors uncorrected with the function to be 
measured, the percentages of success will be exactly comparable even if the 
items include diflferent numbers of distracters. The fact that the percentages 
of correct responses are usually found to increase as the number of dis- 
tracters is decreased may be explained on the ground that it actually is more 
difficult to decide which one of several choices is the keyed answer as the 
number of distracters is increased; that is, for a given combination of item 
stem and correct response, as equally attractive distracters are added, the 
percentage of correct responses in comparable samples of examinees will 
decrease. In other words, for items of this special type, a five-choice situa- 
tion is more difficult than a four-choice situation. It is doubtful whether 
this special type of item is encountered in actual practice any more often 
than the type for which misinformation and partial information play no 
part in determining the examinees' responses. 

It should be pointed out that the corrected percentage of a sample that 
answers an item correctly must not be regarded as the exact percentage 
of the examinees that actually possesses enough information to mark the 
item correctly; it is, of course, only a fallible estimate of that percentage, 
which may be subject to systematic bias. Still less justifiably can the cor- 
rected percentage of success for an item be regarded as the percentage of 
a sample that "knows" the correct answer out of the context provided by 
the particular distracters used with it in a specific item. This is especially 
true of items of the best-answer type; such items are clearly intended to 
determine how many examinees are able to distinguish the choice keyed 
as correct from the particular distracters provided. The difficulty of making 
the distinction is a function of the number of distracters and of their 


attractiveness as well as of the choice keyed as correct. In any multiple- 
choice item, regardless of whether it is of the best-answer type, the likeli- 
hood that an examinee will mark the correct answer depends on how 
successfully he can eliminate the answers keyed "wrong" as well as on 
how successfully he can recognize the choice keyed as "correct" or "best." 
This does not mean that guessing may not take place. An examinee may 
exclude one or more choices from consideration and then guess among 
those remaining; in fact, it works to his advantage to do so, whether the 
conventional correction for chance success is employed or not. 

Summary of correcting versus not correcting for chance. — After reading 
the preceding discussion, the reader may well be looking for some sum- 
mary that will serve as a guide to action. Let us bring together some of the 
main points and state their principal implications. First, it is agreed that 
items should be so written as to make their distracters as attractive as 
possible. Second, it is agreed that tryout tests should be given under as 
nearly unspeeded conditions as practical. Third, it is agreed that examinees 
should be warned not to guess blindly. Fourth, there is some disagreement 
regarding the extent to which blind guessing (or reliance on irrelevant 
non-chance factors) occurs even when it has been minimized, regarding 
the practicality of providing time for every examinee to try every item, 
and regarding the importance of preventing any examinee from thinking 
he has been able to beat the game. 

Naturally, if a test technician believes that chance responses to a test 
are nonexistent, he will not use a correction for chance success. However, 
no one has seriously contended that such is the case; instead, some authori- 
ties believe that under attainable conditions the amount of blind guessing 
may be reduced to negligible proportions. Apparently these authorities 
are not seriously bothered by the fact that a few examinees may deliberately 
Ignore the directions and may knowingly gain an advantage over others. 
Therefore, they do not recommend a correction for chance success. 

Other authorities are willing to undertake the slight additional labor 
Involved in correction for chance success even when the amount of blind 
guessing Is of small proportions because they feel that examinees who 
Ignore directions should realize that they may suffer a penalty for doing 
so. These authorities believe that the attitude of students, teachers, and 
laymen toward the use of tests will be better under these circumstances and 
that, in the long run, less unnecessary chance variance will be introduced 
into test scores as examinees generally come to accept the fact that no 
automatic premium will accompany blind guessing. It is quite true that 
correction cannot remove the effects of chance variance already present in 


test scores, but its use can discourage the practice of blind guessing that 
leads to the introduction of unnecessary chance variance. 

The writer has found that considerations of expense and time do not 
ordinarily permit tryout tests to be given in such a way that every pupil 
can try every item. In this situation the use of a correction for chance 
success tends to prevent examinees who guess blindly or mark items they 
liave not had time to consider from profiting thereby. If most of the items 
are marked largely on the basis of knowledge and partial knowledge plus 
some guessing (as is probably often the case), the correction for chance 
success will tend to be too small even though the presence of misinforma- 
tion will have some effect in the opposite direction. 

In the writer's opinion, it is especially important to make use of a 
correction for chance success in obtaining individual raw scores that are 
to be used for internal-consistency item analysis purposes. Data obtained 
by Bryan, Burke, and Stewart in studies made at the Cooperative Test 
Service of the American Council on Education point to the conclusion 
that internal-consistency item analysis data for tests on which not all 
examinees finished are more meaningful when the high-scoring and low- 
scoring groups have been selected on the basis of total scores that were 
corrected for chance success.^ It is probably less consequential whether 

^ Because correction for chance success often makes test scores easier to interpret, the 
writer is inclined to urge its use for almost all educational purposes. Recently the writer 
tested a group of 393 high school pupils with Form B of the Nelson-Denny Reading Test. 
The 47 pupils who obtained the lowest scores were then grouped together and told that 
they might have done better if they had marked an answer for very item whether they knew 
the answer or not. This is, of course, true since the Nelson-Denny test is rather highly 
speeded and is not scored with a correction for chance success. Form A of the test was then 
administered. The average score of the 47 pupils on the first test (Form B) was 25.53; 
on the second test (Form A) it was 46.32. According to the published norms, this differ- 
ence amounted to a gain of 2.7 grades for the group. Yet, when scores on both forms 
were corrected for chance, the average gain was only 2 raw-score points (though the 
standard deviations of the distributions of scores were larger). This gain, attributable 
largely to regression to the population mean, is obviously a more meaningful and more 
readily interpreted indication of the real change that took place in the reading ability of 
the pupils between testings than is the gain represented by the difference in raw scores 
uncorrected for chance success. This is a rather unusual example but illustrates the point 

The writer discounts statements that the correlations between sets of test scores obtained 
with and without correction for chance are exceedingly high. In the first place, the correla- 
tions cited are usually spuriously high because they are obtained by scoring the same set 
of test papers in two ways. In this situation, the directions for administering the tests are 
the same and thus can be appropriate only to one of the two scoring procedures. It would 
be desirable to obtain the correlation between two comparable forms of the same test 
administered successively to the same sample with two sets of directions — one appropriate 
to scoring without correction for chance success and one to scoring with it. This correla- 
tion should be compared with a parallel-forms reliability coefficient for the same test based 
on the same sample. An exact test of the significance of the difference between the two 
coefficients would permit inferences to be drawn regarding the point at issue. 


the percentages used in obtaining item difficulty and discrimination indices 
are corrected for chance success than whether the scores of the examinees 
on the tryout test are so corrected. There is little doubt that the selection 
of items for the final form of a test will be almost identical regardless of 
whether the item difficulty and discrimination indices have been corrected 
for chance success provided that proper allowance is made for the differ- 
ence in level of item difficulty. Hov/ever, there is reason to believe that 
the discrimination indices are made more nearly independent of the level 
of difficulty of the items when correction for chance success is employed in 
computing these indices {42, pp. 389-90). This is a point in favor of 
using the correction, though it is partially offset by the fact that the reli- 
ability of such indices may be a trifle lower when the correction has been 
made than when it has not. This change no doubt occurs because the in- 
fluence of the more reliable difficulty indices is removed from the discrimi- 
nation indices. The writer values the gain in meaningfulness of the dis- 
crimination indices so much more than the slight amount of reliability 
lost that he is inclined to favor correcting the item analysis data for chance 
success as well as the examinees' scores on the tryout test. 

Items Omitted and Not Reached 
As we have already noted, an item may not be marked for two quite 
different reasons : first, because the pupil reads it, realizes he does not know 
the answer and cannot figure it out, and deliberately omits it as he may 
have been instructed to do; second, because the pupil does not have time 
to consider the item. Unless provision is made for taking into account 
these two different reasons, valuable information about the items is irretriev- 
ably lost and the resulting data become difficult to interpret. To avoid this, 
a simple empirical check may be made. This check is based on the assump- 
tion that the items have been marked in sequence and consists in going 
through the answer sheets or test papers, noting which item on each sheet 
is the last one marked. As this is noted, a mark is made for each answer 
sheet on a separate tally sheet opposite the number of the next item (after 
the last one marked). This next item is presumed to be the first item not 
read by the examinee. If the criterion groups for use in the item analysis 
have already been selected, this check is made separately for the papers in 
each group. After all the papers to be used for item analysis have been 
examined, a cumulative frequency table is made showing the number of 
examinees who have not read each item; that is, have not reached it in the 
time limit. For the last item in a test (or in each separately timed subtest), 
there is no way to make a distinction between examinees who have read 
the item and deliberately refrained from marking an answer to it and 


examinees who have not quite reached it in the time hmit. If, in this 
dilemma, all of the examinees who did not mark an answer to the last 
item are considered to have lacked time to read it, the percent of correct 
responses computed for the item is likely to be spuriously high. On the 
other hand, if all of the examinees who did not mark an answer to the last 
item are considered to have read it and deliberately refrained from marking 
an answer to it, the percent computed is likely to be spuriously low. As a 
practical compromise, the number of examinees who did not reach the next- 
to-last item in the time limit may be taken as the number that did not reach 
the last item. This procedure provides serviceable data for the last item 
with the expenditure of no additional labor. 

In effect, then, the percent answering an item correctly should be based 
on the number of examinees who marked the item plus the number of 
examinees who read it but decided not to mark it. In practice, this procedure 
prevents items that are read but not marked by quite a few examinees from 
appearing to be easier than they really are. It is apparent that the percent 
of a sample of examinees who decide to omit items depends in large meas- 
ure on the directions for the test. For reasons explained previously, direc- 
tions for most tests should discourage sheer guessing; hence, the examinees 
are apt to omit items in order to avoid a penalty for guessing incorrectly. 

If, in the computation of the percent of examinees that answer an item 
correctly, we include the number who did not have a chance even to read 
the item in the time limit, we are in effect acting on the assumption that 
the examinees who did not reach an item would mark it correctly only as 
often as chance would permit. This assumption seems unreasonable to the 
writer, who would prefer to work on the assumption that the examinees 
who did not have time to read an item would have answered it correctly 
in about the same proportion as those who did read it.* This assumption 
seems more reasonable than any other that has practical utility, even though 
it is probable that the distribution of responses of examinees who did not 
read the item would (if they were given additional time) tend to be more 
like a chance distribution than like the distribution of responses made by 
those who did read the item (especially if the items have been arranged 
in order of difficulty) . This expectation is based on the positive correlations 
usually found between proficiency (or level) scores and speed scores in any 
mental trait. 

* Note that, if the criterion groups have already been identified, the procedure described 
in the preceding paragraphs is applied to each group separately. Thus, the assumption is 
that the examinees in a defined group who did not have time to reach an item in the time 
Hmit would have answered it correctly in about the same proportion as the exammces 
in the defined group who did reach it. 


Formulas for Computing Percent of Correct Responses 
Formulas ( 1 ) , ( 2 ) , and ( 3 ) are recommended for computing the per- 
cent of correct responses to an item of the types specified, provided that a 
correction for chance success is to be used. If this correction is not to be 
used, these formulas can be modified by deleting the fraction of the 
"wrongs" that appears in the numerator of each formula. This modifica- 
tion eliminates the correction for chance success but retains the adjust- 
ment for items not reached. 


Rt ' — ■ — 

ki- 1 

Pt = 100 , (1) 

Nt - NRt 

where Pt — the percent of correct responses in the entire sample ad- 
justed for chance success and for omissions caused by not 
reaching the item in the time limit, 
Rt = the number of examinees in the entire sample who answer 

the item correctly, 
Wt = the number of examinees in the entire sample who answer 
the item incorrectly, 
ki = the number of choices in the item, 
Nt = the number of examinees in the entire sample, 
NRt ~ the number of examinees in the entire sample who do not 
reach the item in the time limit. 

If every examinee has reached every item in the time limit, NRt becomes 
zero and computation of the adjusted percents is simplified accordingly. If 
adjustment for chance success is not required, the subtraction term in the 
numerator of formula (1) is omitted, again simplifying the computation. 

When matching exercises are used, they should be constructed in such a 
way that the directions may properly specify that one of the terms in a 
series may be the correct answer to more than one term in the other series. 
To compute the percent of correct responses for each item in a matching 
exercise when the number of terms in each series in the exercise is the same, 
when more than one term in a series can be correctly matched to a single 
term in the other series, when there is no restriction on the order in which 
the examinees may mark responses to the items in each exercise, and when 
the same number of examinees has read every item in the exercise, the fol- 
lowing analogue of formula (1) is recommended: 




ks- 1 

Nt - NRt 


where ks = the number of terms in each series in the matching ex- 

A matching exercise is defined as a set of related items. When formula 
(2) is used, it is required only that every examinee shall have marked every 
item in a given exercise. It is not necessary that the same number of ex- 
aminees mark the items in different exercises. 

A practical example to show how formula (2) is applied may be of in- 
terest. Suppose the following matching exercise has been given to 100 ex- 
aminees and that the number selecting each response is as shown: 

1. Boston 

2. Pittsburgh 

3. Philadelphia 

I. Pennsylvania 
II. Massachusetts 
III. Connecticut 


















The corrected percent for item 1 in this matching exercise is found by sub- 
stituting in formula (2) : 

Pt = 100 = 70. 

100 - 

This result indicates that approximately 70 percent of the examinees 
actually know that Boston is in Massachusetts or that it is not in Pennsylvania 
or Connecticut, although 80 percent of them marked the item correctly. 
For Pittsburgh the corrected percent is 62.5, and for Philadelphia it is 70. 
Like the correction for chance used with multiple-choice items, this pro- 
cedure is a practical compromise that does not yield corrections that are 
strictly unbiased in each individual instance. 

When a matching exercise has a different number of terms in its two 
series, the same basic principle is applied in correcting for chance. Again, 
on the assumptions that the same number of examinees have marked every 
item in a given exercise, that more than one term in the shorter series can 
be correctly matched to a single term in the longer series, and that no restric- 
tion is placed on the order in which the examinees may mark responses to 
the items in each exercise, the following formula is suggested: 

Pt = 100 



Nt - NRt 
the number of terms in the longer series.^ 


"Note that formula (3) is explicitly ruled out when the terms in the longer series are 
to be matched to terms in the shorter series. For example, the situation in which many 


The percents obtained by means of formulas (1), (2), and (3) may 
be used as item difficulty indices with considerable success. They can be 
regarded as essentially comparable regardless of the number of choices in 
each multiple-choice item or the number of terms in each series in a match- 
ing exercise. Needless to say, percents that are used as difficulty indices 
without correcting for chance and without adjusting for the failure of some 
examinees to read some items will be comparable only if no blind guessing 
occurs and if every examinee has marked every item. 

A disadvantage inherent in the use of percents as difficulty indices is the 
fact that percents from 1 to 99 do not even approach an interval scale of 
difficulty. An interval scale (contrasted with a nominal or ordinal scale) 
possesses the properties of both order and equality in its units. It is well 
known that (in the case of traits that are not distributed rectangularly) a 
given difference between two successive percents is indicative of a greater 
actual difference in the underlying trait being measured as the percents 
being compared are moved away from 50. For this reason, when percents 
are used as difficulty indices, constants cannot properly be added to or sub- 
tracted from them; neither can they legitimately be averaged. 

Yet, on many occasions it is convenient to express the difficulty indices 
of items tried out in two different samples on a single scale of difficulty or 
to compare the average difficulty of two sets of items tried out on the same 
sample. To make possible such manipulations of indices on a reasonably 
legitimate basis, the percents obtained by using formulas (1), (2), and 
( 3 ) may be transformed into standard-deviation units. The writer has pro- 
posed that for maximum convenience to test editors this transformation be 
so made that the range of item difficulty indices will be 1-99 (7, pp. 7-8) . 
If the population distribution of the variable being measured is assumed to 
be normal, this can be accomplished by using the Kelley-Wood Table of the 
Normal Probability Integral {102, Table 1) to transform the percents 
yielded by formulas (1), (2), and (3) into standard-deviation units, by 
multiplying each of the latter units by a constant (21.063 to five significant 
figures) , and adding 50, algebraically, to each product. When this procedure 
is followed, percents of 1, 50, and 99 obtained from formulas (1), (2), 
and (3) will remain 1, 50, and 99, respectively, but almost every other 
percent so transformed will take a different value. This transformation re- 
tains the advantage of 50 as the middle of the scale and of 1 and 99 as 

terms are to be placed in one of two categories (which comprise the shorter series) is a 
two-choice item and should be so corrected for chance success. 


nearly extreme values while it places all intermediate indices on an ap- 
proximation to an interval scale with respect to difficulty in the trait meas- 
ured. For additional convenience, percents of and 100 (for which the 
corresponding standard-deviation units are, of course, infinity) are regarded 
as and 100, respectively, on the new scale." To express the transformed 
difficulty indices for one set of items tried out in two different samples on 
a single scale with sufficient accuracy for most practical purposes, it is neces- 
sary only to average the transformed indices for each sample and to subtract 
the difference between the means from the transformed difficulty index of 
each item in the sample having the higher mean. To express the trans- 
formed indices for two different sets of items tried out on different samples, 
an identical group of representative items may be inserted in each set. The 
difference between the averages of the transformed indices for this group 
of items in each sample may then be obtained and used as a constant for 
subtraction from the transformed index for each item in the set accompanied 
by the higher mean for the group of identical items. To compare the average 
difficulty of two different sets of items tried out on the same sample, the 
transformed indices may be averaged and the difference noted. 

Experimental evidence has shown that difficulty indices of the sort de- 
scribed are extremely reliable when they are based on samples as large as 
400. The convenience of using the same percents for obtaining both diffi- 
culty indices and discrimination indices has led to the computation of diffi- 
culty indices that are only estimates of the percents in the entire sample. 
These estimates are sometimes based on data obtained from only the high- 
est and lowest 27 percent of the sample tested and rest on the assumption 
that the regression of individual item scores on the scores used for selection 
of the highest and lowest 27 percent of the sample is rectilinear. This 
assumption is not unreasonable in most cases and makes possible consider- 
able economy of labor. It should, however, be subjected to experimental 

The writer has computed the reliability coefficient of a group of typical 
item difficulty indices estimated in this way and has found it to be .98 when 
the sample included 100 examinees in the highest 27 percent and 100 
examinees in the lowest 27 percent [42, pp. 385-90; 41; 47). These data 
suggest that the loss of reliability incurred by estimating indices from only 
54 percent of the sample tested is not sufficient to be of practical conse- 
quence when the two criterion groups employed include at least 100 ex- 
aminees apiece. 

'A table for transforming percents into these indices is provided by Davis (7, T;ible A). 


In actual practice, then, formula (1) may be replaced by the following: 

= 100- 





ki- 1 

Peh - 

Nh - NRh ' 


- NRl 


where Psst = the estimated percent of correct responses in the entire 
sample adjusted for chance success and for omissions 
caused by not reaching the item in the time limit, 
Kh = the number of examinees in the highest 27 percent of the 

sample who mark the item correctly, 
R[^ = the number of examinees in the lowest 27 percent of the 

sample who mark the item correctly, 
Wh = the number of examinees in the highest 27 percent of the 

sample who mark the item incorrectly, 
Wl — the number of examinees in the lowest 27 percent of the 

sample who mark the item incorrectly, 
Nh — the number of examinees in the highest 27 percent of the 

Nl ~ the number of examinees in the lowest 27 percent of the 
NRh = the number of examinees in the highest 27 percent of the 

sample who do not reach the item in the time limit, 
NRl = the number of examinees in the lowest 27 percent of the 
sample who do not reach the item in the time limit, 
ki = the number of choices in the item. 

Formula (2) becomes: 

Rh - Rl - —— 



Nh-NRh Nl-NRl 

PEst = 100 . (^) 

and formula (3) becomes: 

Wh Wl 

Rh - ■ Rl - — - 

ki — 1 ki — 1 

Nh-NRh Nl-NRl 
PEst = 100 (6) 


Formulas (4), (5), and (6) may be modified if correction for chance 
success is not to be used simply by deleting the fraction of the "wrongs" 


in the high and low groups indicated for subtraction. This modification re- 
tains the adjustment for items not reached. 

The proposal has sometimes been made that item difficulty indices be 
transformed into scores having widely known characteristics. For example, 
the Cooperative Achievement Tests ordinarily yield Scaled Scores that 
possess properties with which many test users and test constructors are 
familiar. It would be possible to equate difficulty indices for the items in 
an experimental test to these Scaled Scores. This could be done most easily 
if the item were assumed to measure the same trait as one of the Cooperative 
tests for which Scaled Scores have been provided, but it could be accomp- 
lished whenever the correlation of scores (pass or fail) on the item with 
scores on such a test was known. Although some advantages would accrue 
if this proposal were carried out, they might not be great enough to war- 
rant the expenditure of time and labor required. 

Before discussing the use of difficulty indices in item selection, we shall 
consider the measurement of item discriminating power. 

Item Discriminating Power 

For many years test constructors have recognized the need for determin- 
ing the value of each test item for making the test in which it is included 
rank examinees accurately with respect to a defined criterion variable. If the 
test constructor knew the relative value of each item, he could select only 
the best for inclusion in the final form of the test. Unfortunately, there is 
no analytic solution to this problem that is even reasonably practical. 
Theoretically, if suitable criterion scores were available, product-moment 
correlation coefficients between them and the scores on each individual item 
could be computed. Likewise, the intercorrelations of the items, could be 
obtained. Then a multiple regression weight for each item could be obtained 
and items having weights significantly diff^erent from zero at a specified 
level of confidence could be retained in the final form of the test. A require- 
ment inherent in this procedure, which renders it of theoretical interest only, 
is that the variances of all the principal components of the matrix of item 
intercorrelations would have to be proved significantly different from zero 
at an appropriate level of confidence. Obviously, this could not be done 
without using samples of fantastic size. Since regression coefficients for in- 
dividual items that are obtained on the basis of samples of ordinary size 
(say, up to 1,000 cases) are not meaningful, the approximation procedures 
that may be used by test constructors are probably more defensible as well 
as more practical. Later in this chapter we shall discuss these procedures. 
First, it may be best to consider specific methods of expressing the item- 
criterion relationship. 


The Criterion Variable 

Whenever it is practical, test constructors should obtain as nearly un- 
biased measures as possible of the ultimate criterion variable that is ap- 
propriate for the test being built. That these measures be highly reliable 
is much less important than that they be unbiased. For most achievement 
tests, however, it is difficult or impossible to obtain criterion measures. As 
pointed out in chapter 6 on "Planning the Objective Test," the content of an 
achievement test is often formulated by analyses of curriculums and text- 
books and by the pooled judgment of recognized authorities in the field. 
Under these circumstances a well-constructed test may constitute the best 
available measure of the criterion; in a sense, the test itself defines the 
function it is to measure. Such tests may be described as self-defining. 

The total scores derived from a self-defining test are, therefore, often 
used as the immediate criterion m^easures with which the individual items 
in the test are correlated. These relationships between the total scores 
derived from a test and item scores are referred to as internal-consistency 
item discrimination indices. The relationships between item scores and 
scores in a criterion variable other than the total score on the test are re- 
ferred to as "item validity indices." The term "item discrimination indices," 
as used in this chapter, includes both internal-consistency item discrimina- 
tion indices and item validity indices. 

When the total score on a test is used as the criterion variable for judg- 
ing the discriminating power of each item in it, the resulting indices re- 
flect, among other influences, the extent to which the item measures the 
same mental functions as the total score. The fact that some items prove 
to have more discriminating ability than others means that for the group 
tested they are better measures of whatever the whole test actually meas- 
ures. Except when all of the items in a test measure the same mental 
functions with less than perfect reliability, internal-consistency item 
analysis data provide an inadequate basis for comparing the discriminating 
power of items for measuring whatever each individual item actually does 
measure. If two items of the same difficulty that have identical discriminat- 
ing power for measuring two different mental traits are tried out with 98 
homogeneous items, the one of the two that measures a trait more heavily 
weighted in the 98 items than the other will turn out to have the higher 
discrimination index. This is an obvious point but easily overlooked. It is 
an important point, too, because in practice we try out items in rather 
heterogeneous groups. Suppose, for example, that in a group of reading 
items of similar difficulty 90 call for finding detailed facts while 10 call 
for making inferences. If discrimination indices are computed for these 


items, using the total score on the 100 reading items as the criterion vari- 
able, the 10 inference items will be found to have markedly lower dis- 
crimination indices than the items that call for finding detailed facts. If a 
50-item reading test is to be constructed by selecting the items having the 
highest discrimination indices, no inference items are likely to be included. 
Yet experts in the field of reading would criticize the validity of such a 
test as a measure of reading ability. To avoid situations like this, it is de- 
sirable to assemble items into closely homogeneous groups for item anal- 
ysis purposes. For example, items that call for finding detailed facts and 
for making inferences should be grouped separately. This may sometimes 
be impracticable, as would be the case if only 10 items of a given type 
were available. 

Methods of Expressing Item Discriminating Power 

Many ingenious statistical procedures have been devised to provide dis- 
crimination indices. Some of these are so simple and require so little labor 
that teachers may profitably make use of them for ordinary classroom tests. 
Suppose, for example, that fifty pupils mark answers to each of fifty true- 
false items on material studied during the preceding week. Usually class- 
room tests of this kind are given with enough time for every pupil to try 
every item. The teacher may score the tests and put the test sheets in order 
by size of score. From the pile of papers he may take the top quarter 
(twelve to thirteen papers) and the bottom quarter and simply tally the 
number of correct answers to item 1 in each quarter. After this has been 
done for all fifty items, he can discard items with approximately the same 
number of tally marks in the two groups or with more tally marks in the 
bottom than in the top quarter. This procedure will tend to identify items 
with little internal-consistency discriminating power and lead to greater 
efficiency of measurement in a revision of the test composed of the more 
discriminating items. The fact that the procedure is crude and subject to 
a variety of errors should not deter teachers from making use of it. On the 
other hand, for professional work in test construction it is not adequate. 

A variation of well-known graphic methods of item analysis has been 
described by Turnbull (32). After dividing the tryout group into sixths 
on the basis of some criterion (such as total test score), the test constructor 
plots on specially prepared graph paper the percent of the examinees in 
each sixth who marked the correct answer to each item. The percent who 
marked each incorrect choice is also plotted. Each sheet of graph paper then 
indicates the discrimination ability of the correct answer and each of the 
incorrect choices for a given item. This process provides revealing data at 


the cost of a great deal of labor. Short cuts are provided, however, and a 
means of estimating an item-criterion correlation coefficient is outlined. The 
method might be useful if large numbers were tested and if IBM equip- 
ment were available. 

Critical ratios have sometimes been suggested as measures of item 
discriminating power: the percents of correct responses in two separate 
criterion groups are obtained, and the differences between these pairs of 
percents are computed for comparison with their own standard errors. The 
resulting critical ratio for each item does indicate how likely it is that a 
given item actually differentiates between the two criterion groups, but it 
does not usually make conveniently possible a comparison of the amount 
of discriminating power possessed by a number of items. It fails in this 
respect because its magnitude is in part dependent on the number of in- 
dividuals in the two criterion groups who marked a response to each item. 
In practice, this number may sometimes decrease from item to item in each 
group because some examinees ordinarily fail to reach items toward the end 
of a tryout test. 

Chi square has been proposed as a measure of item discrimination by 
Guilford, and others ( -? 6 ) . In an unpublished study, Cureton has sug- 
gested using a variation of the chi-square test for the same purpose. But, 
as the latter has pointed out, critical-ratio and chi-square tests are applicable 
only to large samples and to item choices that are in the middle of the 
range of attractiveness to the examinees. To circumvent this limitation, 
Cureton has proposed the use of a chi test. However, all of these tests 
suffer from the limitation mentioned in connection with the use of the 
critical ratio as a measure of item discrimination; namely, that they do not 
usually make conveniently available a comparison of the amount of dis- 
criminating power of a group of items. On the other hand, indices that 
permit easy comparison of the amount of discriminating power of a group 
of items do not lend themselves to exact tests of significance. 

The chi test proposed for use by Cureton is designed to determine, at 
designated levels of confidence, whether a samiple drawn at random from 
the population in which the correct answer to the item is marked by equal 
proportions of the two criterion groups will have the proportion of the 
high-scoring criterion group marking the correct answer as large as or 
larger than the proportion of the low-scoring criterion group marking the 
correct answer. Inasmuch as this test is not widely known and is applicable 
to a greater degree than most others when the samples are small, it is out- 
lined briefly here. The chi test involves no assumptions regarding the shape 
of the distribution of the traits measured by either the criterion or item 


Chi is computed as follows if the number of examinees in the high- 
scoring group who mark the item correctly is greater than the number in 
the low-scoring group who mark it correctly: 

CM = _ ^ R-'-R'.-^ . p) 

Rt \ 


Nt - NRt/ 

Chi is computed as follows if the number of examinees in the high-scoring 
group who mark the item correctly is smaller than the number in the low- 
scoring group who mark it correctly: 

Rh- Rl+1 ■ „, 

Chi = , (8) 

' / i?5 ■ 


Nt - NRt/ 

where Rh ~ the number of examinees in the high-scoring group who 
mark the item correctly, 
K/, = the number of examinees in the low-scoring group who 
mark the item correctly, 

Rt — Rf/ -1- Rl , 

Nt = the number of examinees in the high-scoring and low-scor- 
ing groups, 
NRt = the number of examinees in the high-scoring and low- 
scoring groups who do not reach the item in the time limit. 

For reasons to be explained later in this section, the high-scoring and 
low-scoring groups should each constitute about 27 percent of a group for 
which continuous criterion scores are available. In any case, the two groups 
should be of equal size to make use of the significance levels shown in 
Table 6. This table is an abridgement made by Cureton of Table 8 in 
Stathtkal Tables for Biological, Agricultural, and Medical Research [100). 

Among statistics proposed for expressing item discriminating power in 
the form of relationships, the biserial product-moment r (sometimes called 
the point biserial r), the biserial r, the tetrachoric r, and the phi coefficient 
have been suggested. The choice among these depends partly on the pur- 
pose for which the test and the item analysis data are to be used, partly 
on the convenience with which each statistic serves that purpose, and partly 
on the ease and economy of computation required by the practical circum- 

If the test constructor is primarily interested in constructing a predictor 
test for use with a group of examinees having essentially the same level 
and distribution of ability as the group used for item analysis purposes, the 
biserial product-moment r may be used when the criterion is a continuous 




Values of Chi at Various Significance Levels for 
Certain Sample Sizes 


Value cf If Formula (7) 


Value of Chi at Significance Level 

Is Used; or 


Value of — ■ — ■ ■ 

If Fosmula (8) Is Used 

.05* or 

.01^ or 
















101 and over 



^ If formula (7) is used. 
** If formvila (8) is used. 

variable and will be used as such. This statistic would be especially desir- 
able in the rather unusual case when the item-criterion relationships were 
to be used to obtain multiple regression weights for each item. Formulas 
for computing biserial product-moment coefficients and their variance errors 
are presented by Kelley {101, pp. 370-73). A convenient computing 
formula is as follows: 



Mr - M(N^-NR r) / pR Q 

Mr = the mean criterion score of examinees who 
mark the item correctly, 

the mean criterion score of examinees who 
reach the item in the time limit, 



the standard deviation of criterion scores of ex- 
aminees who reach the item in the time limit, 
pii = the number of examinees who mark the item 
correctly divided by the number who reach the 
item in the time limit. 

If the criterion variable is a natural dichotomy and must be used as 
such, an acceptable method of expressing the item-criterion relationship 
is the phi coefficient when the group with which the test is to be used is 
essentially the same with respect to level and distribution of ability as the 
group used for item analysis purposes and when the point of dichotomy 
in the criterion variable remains constant in successive groups in which 


the test is being used for prediction purposes. When these extremely un- 
usual conditions are met, the phi coefficient is appropriate because it is a 
rigorous product-moment r subject to precise tests of significance and suit- 
able for use in computing multiple regression weights. Computational 
routines for obtaining phi coefficients and their variance errors are illus- 
trated by Kelley {101, pp. 379-82). 

A table from which phi coefficients may be read directly has been pre- 
pared by Jurgensen [33) for use in the special case when the number of 
examinees in each of the dichotomous criterion groups is the same. This 
condition is rarely met precisely in item analysis work because some ex- 
aminees usually fail to reach certain items in the time limit, but Jur- 
gensen's table might be of considerable value when every examinee (or 
nearly every one) has marked an answer to every item. 

An abac published by Guilford (16) makes possible estimation of the 
phi coefficient from percents of correct responses made by examinees in 
high-scoring and low-scoring groups of a distribution of criterion scores. 
Unless these groups consist of the highest and lowest 50 percent of the 
sample, this procedure leads to coefficients that take values having no 
analytical relationship to the product-moment r and that are not subject 
to precise tests of significance. Thus, they do not possess the two properties 
that justify the use of the phi coefficient under the circumstances described 
previously, but they are economical to compute. 

To avoid some of the disadvantages of the phi coefficient and the biserial 
product-moment coefficient, the biserial and tetrachoric correlation coeffi- 
cients have been suggested as indices of item discriminating power. Both 
of these require the assumption that a normally distributed underlying 
variable has been forced into a dichotomy. Computation of the biserial r 
demands this assumption only in the case of one variable, while computa- 
tion of the tetrachoric r requires acceptance of this assumption for both 
variables. It is clear that when an assumption of normality is made without 
evidence to support it, any tests of the significance of coefficients computed 
on the basis of it may be misleading. Only when cogent reasons for doing 
so are presented should the test constructor abandon product-moment co- 
efficients for estimates of relationships expected in parent distributions 
assumed to be normally distributed. 

As it happens, such reasons often present themselves. It is sometimes 
difficult or even impossible to obtain for experimental purposes samples 
of examinees in which the level and distribution of ability are reasonably 
similar to those in the groups with which the final form of the test is to be 
used. Sometimes a test is to be used in several school grades. In this case 


the selection of items for the final form should theoretically take into 
account separately the discriminating power of each item among pupils 
in each grade in which the test is to be used. Given this situation, the 
problem of item selection can become very complicated, and test editors 
often seek to simplify it by using for each item one discrimination index 
that is unaffected by the average level of ability of the pupils and that is 
based on a weighted proportion of pupils at each of several grade levels. 

A different kind of consideration that often leads the test constructor to 
prefer an index of item discriminating ability that is essentially unaffected 
by the level of difficulty of each item is the efficiency of measurement that 
may sometimes be obtained by deliberate control of the distribution of item 
difficulty levels when a test is assembled. It is a well-known fact that the 
variance of any single item is at a maximum when it is used in a group of 
examinees in which 50 percent mark it correctly. From this it logically 
follows, as Gulliksen has demonstrated, that test-score reliability and 
variance are maximized when every item is of 50 percent difficulty regard- 
less of the level of item intercorrelation (73). Unfortunately, many test 
constructors have uncritically accepted high over-all reliability as the pri- 
mary goal in test construction. Sometimes it should be so accepted, but 
often the over-all reliability coefficient of a test is merely an interesting 
but irrelevant statistic. Ordinarily, the primary goal in test construction is 
to maximize the number of discriminations among all the examinees or 
between such groups of examinees as the test administrator designates. 
Methods of selecting items to approximate this goal will be discussed ^in 
some detail in a later section of this chapter. 

To provide an index of discriminating ability that is essentially un- 
affected by differences in the percent of testees answering correctly items 
scored "right" or "wrong," the biserial r may be employed when the 
criterion variable is continuous. One of the computing formulas for this 
statistic, which is given by Dunlap (iO), may be written in the notation 
of this chapter as follows: 

This = ■■—' (10) 

^ (.Vy-iVRy) Z 

where z = the ordinate in the unit normal distribution which divides 
the area under the curve into the proportions p and q. 

If the criterion scores of the examinees have been corrected for chance 
by means of an appropriate formula, the procedure described above will 
provide correlation coefficients based on corrected scores and thus com- 


parable in this respect to otlier types of discrimination indices to be de- 
scribed later in this chapter. 

The variance error of biserial r may be approximated when an item is 
marked correctly by from 5 percent to 95 percent of the testees who read 
it by the following formula (which assumes that the distribution under- 
lying the dichotomy is exactly normal and that the sample is large) : 

^'-bis '■ 

(^ -,..)■ 

Nt - NRt 


JMt - I\Kt 
where q = 1 ~ p. 

It is always desirable to make use of item analysis data for which vari- 
ance errors may be computed, yet, in practice, tests of significance may often 
be of little utility because the test constructor ordinarily has to make use of 
the best items he has even if the relationships of some of them with the 
criterion variable are not significantly different from zero. When a test 
of significance can be utilized, however, the variance error of zero fbis 
is often found most useful. Formula (11) may be simplified to compute 
the statistic: 

^zero rbis — ~T77Z , ^„ . ' \^^) 

z\Nt - NRt) 

If the criterion to be used for item analysis purposes is a natural 
dichotomy, we cannot use biserial r as an index of item validity for items 
scored only "right" or "wrong." A situation of this kind arises when we 
wish to correlate scores on each of a set of items with a dichotomous cri- 
terion variable such as "graduated from college with a degree" and "did 
not graduate from college with a degree," or "completed pilot training and 
received his wings" and "did not complete pilot training and receive his 
wings." In this situation, if we want to obtain an index of item validity 
that is comparable to the biserial r, the tetrachoric correlation coefficient 
should be employed. Its computation assumes that the distributions under- 
lying the two dichotomies are exactly normal. Needless to say, these as- 
sumptions should never be made unless there is good reason for believing 
them to be true, or so nearly true that the data obtained by making them 
will not lose their serviceability. 

The computation of tetrachoric coefficients by formula is so laborious 
that it is never attempted for practical purposes when a large number of 
coefficients is required. Tables to facilitate the computation of tetrachoric 


r have been available for many years {103, Tables 29-30) and, more 
recently, computing diagrams have become available for the same pur- 
pose {98; 31). The writer recommends these computing diagrams when 
tetrachoric r is to be used as a measure of item validity. Goheen and 
Kavruck have recently published a work sheet that may be found con- 
venient for use with Hayes' diagrams {13). The computational routine 
for use with the diagrams prepared by Chesire, Saffir, and Thurstone is 
presented here in order to show how the correction for chance and the 
adjustment for examinees who did not reach some items in the time limit 
may be incorporated in the procedure. A fourfold table is iirst set up with 
each of the cells lettered as in the accompanying diagram. 



1 C 

D E 


G H 1.000 

ki - 1 


- NRh - NRl 

Nh - NRh 


- NRh - NRl 


Entries in the cells of this table may be filled in by means of the follow- 
ing formulas: 

B = ■ ^ , (13) 

Nt - NRh - NRl 

Nh- NRh 

^ = -^r .rr. ' (14) 

ki - 1 

E = ■ , (15) 

Nt - NRh - NRl 

J = C - B, (16) 

D = F-E, (17) 

F = 1 - C, (18) 

G = /f-+-A ■ (19) 

H= B+ E, (20) 

where Rh — the number of examinees in the high criterion group who 
mark the item correctly, 
Rl = the number of examinees in the low criterion group who 

mark the item correctly, 
Wh = the number of examinees in the high criterion group who 
mark the item correctly, 


Wij = the number of examinees in the low criterion group who 

mark the item correctly, 
Nt = the number of examinees in both criterion groups, 
Nh = the number of examinees in the high criterion group, 
NRh = the number of examinees in the high criterion group who 

do not reach the item in the time hmit, 
NRi, = the number of examinees in the low criterion group who 
do not reach the item in the time limit, 
ki = the number of choices in the item. 

The appropriate cells in each fourfold table are used to enter the com- 
puting diagrams, and the tetrachoric r is obtained with a minimum of 
labor. The entry in cell H may be multiplied by 100 (to convert it into a 
percent) and used as an index of item difficulty, for it will be comparable to 
the adjusted percent provided by formula ( 1 ) .'^ 

The variance error of a tetrachoric correlation coefficient can only be 
approximated, but there are occasions on which the test constructor is 
willing to utilize even a dubious variance error to set some sort of objective 
standard for rejecting items that do not appear to possess significant rela- 
tionships with the criterion (at a specified level of confidence). The fol- 
lowing formula may be employed to estimate the variance error of a tetra- 
choric r when the true value is zero : 

V = ^^^^ ■, (21) 

*'"' zH'^Nt - NRt) 

where p = the proportion of those examinees that reach the item in the 
time limit who are in the high criterion group, 

q = \ - p, 

Rt — 

ki- 1 

^ Nr - NRt 

f-1- p', 

z — the ordinate in the unit normal distribution which divides the 
area under the curve into the proportions p and q, 

z' = the ordinate in the unit normal distribution which divides the 
area under the curve into the proportions p' and q', 

' If desired, these adjusted percents can be immediately transformed into corresponding 
values on the scale of difficulty indices proposed by Davis (7, Table 4). 


Nt — the number of examinees in the sample, 
NRt = the number of examinees in the sample who do not reach the 
item in the time limit, 
Rt = the number of examinees in the sample who mark the item 

Wt = the number of examinees in the sample who mark the item 
ki = the number of choices in the item. 

An example of a situation in which the use of tetrachoric r as an index 
of item validity is to be preferred is provided by the validation of items 
constructed for use in the Aviation Cadet Qualifying Examination against 
the criterion of passing or failing in pilot training in the Army Air Forces. 
This criterion is a good example of a variable in which the underlying abil- 
ity (to learn to fly an airplane) is probably normally distributed but is 
expressed as a dichotomy. Yet the proportion of cadets who were assigned 
scores of "pass" varied markedly because of administrative considerations 
entirely irrelevant to the psychological factors involved. Scores on each 
multiple-choice item used in the Aviation Cadet Qualifying Examination 
were expressed only as "right" and "wrong," though it is reasonable to 
suppose that a normal distribution of talent underlay the responses to each 
item, especially to items not extremely easy or difficult. For practical rea- 
sons, the items had to be tried out in groups of aviation students already 
selected by previous forms of the test. These students constituted a far 
more able and more homogeneous group than the applicants for cadet 
training with whom the final form of the test was to be used. In this situa- 
tion, the tetrachoric r obtained on the basis of the selected group is clearly 
a better estimate of the correlation of the underlying variables in a sample 
of applicants than the phi coefficient or any rigorous product-moment r. 

A method for estimating tetrachoric correlation coefficients for use as in- 
dices of item discriminating power that is somewhat more economical than 
the method described above for use with the computing diagrams was 
suggested in 1940 (28). This method is not applicable except when either 
the upper and lower 50 percent or the upper and lower 25 percent of the 
testees with respect to the criterion variable can be identified. In other 
words, the method cannot ordinarily be used if the criterion variable is a 
natural dichotomy. 

Vernon has suggested the use of what he calls double tetrachoric co- 
efficients. He has shown that these are more reliable than ordinary tetra- 
choric coefficients, yet may be obtained very rapidly (33).^ 

* This excellent article by Vernon presents a summary of item analysis techniques with 
some evaluative commentary. 



From the preceding discussion in this chapter, it is apparent that the 
type of item analysis data to be selected for use in constructing a test is 
dependent on the purpose of the test constructor and the kind of basic data 
he can obtain. There is no one type of item analysis data that is best under 
all circumstances. Nevertheless, circumstances that favor the use of biserial 
r as an index of discriminating power appear to be most numerous. These 
circumstances include the fact that the level of distribution of ability is not 
always the same in experimental and consumer groups of examinees, and 
the fact that it is rarely desirable for a test to have its minimum standard 
error of measurement exactly at the raw-score point corresponding to the 
average level of ability in the group tested. 

Unquestionably, biserial correlation coefficients would have been used 
more frequently in the past for item analysis purposes if they were not so 
laborious and expensive to obtain. To meet the need for an economical 
approximation to biserial coefficients, Flanagan, working with Kelley, de- 
vised an ingenious procedure based on the fact that, since the magnitude 
of a correlation coefficient is determined by extreme cases to a much greater 
extent than by cases near the middle of the bivariate surface, an estimate of 
the coefficient may be obtained with a much greater decrease in labor than 
in accuracy by utilizing only the data in the tails of the two distributions. 
This procedure became of practical utility for estimating correlation co- 
efficients in 1931 when Part II of Pearson's Tables for Statisticidns and B'w- 
jnetrkians {106) was published. Tables 8 and 9 provide the frequencies 
on a normal bivariate surface for cells one-tenth of a standard deviation 
square having lower limits of 0.0, ±0.1, ±0.2, . . ., ±2.5 standard devi- 
ations from either or both means. All cells having lower limits of ±2.6 
standard deviations extend to infinity. These frequencies are provided for 
product-moment correlation coefficients at intervals of .05 from —1.00 to 
+ 1.00. 

Given these data, all that remained to make the procedure practical was 
to determine the best proportion of the tails of the criterion distribution to 
be employed and to construct tables for convenient use. Kelley demonstrated 
in 1939 {19) that for items of 50 percent difficulty and low reliability 
scored in graduated amounts the optimum proportions are 27 percent 
in each tail of the criterion distribution. Though items are not always 
close to the 50 percent level of difficulty and are rarely scored in other than 
two categories ("right" and "wrong"), Kelley concluded that the upper 
and lower 27 percent of the criterion distribution are ordinarily most 

Empirical evidence to support this conclusion was obtained prior to 
1942 in unpublished studies at the Cooperative Test Service; in a report on 


Printed Classifcation Tests, edited by Guilford and Lacey {49, pp- 
30-31 ) , some evidence of this kind has been pubHshed. Internal-consistency 
item discrimination indices were computed by several methods for 68 
items in a test called Visualization of Maneuvers. Two separate samples 
of 400 aviation cadets were employed. The product-moment correlations 
between indices based on these two samples were obtained to provide a 
measure of the reliability of indices computed by the various methods em- 
ployed. The reliability coefficient for biserial rs is .87, for estimates of the 
biserial r obtained from Flanagan's table it is .87, and for tetrachoric rs 
it is .79. 

The use of Kelley's method assumes that normal distributions of talent 
underlie the dichotomous item response categories of "right" and "wrong" 
and the criterion distribution of which only the highest and lowest 27 
percent are utilized. Rectilinearity of regression of item scores on criterion 
score is also assumed. It is evident, therefore, that the correlation co- 
efficients estimated by means of Flanagan's table are strictly analogous to 
tetrachoric correlation coefficients, though more reliably determined than 
the latter. 

In 1935 Flanagan published an abbreviated table of product-moment 
correlation coefficients [45, Table 11) and in 1936 made available A 
Table of the Values of the Product-Moment Coefficient of Correlation in a 
Normal Bivariate Population Corresponding to Given Proportions of Suc- 
cesses. Later, he presented a valuable discussion {12) of the use of co- 
efficients obtained by means of this table, which has been widely used to 
obtain item-criterion correlation coefficients. Moreover, Flanagan's table 
has general utility and may be found exceptionally useful in many in- 
stances outside the field of test construction where economical approxima- 
tions to biserial correlations are required. 

The variance errors for coefficients of correlation read from Flanagan's 
table cannot now be determined by analytical means, but it is evident on 
the basis of both theoretical considerations and empirical data that their 
sampling errors are larger than those of biserial rs and smaller than those 
of tetrachoric rs. This may come as a surprise to many research workers 
who may have supposed that the reliability of the resulting data would 
be impaired by eliminating the middle A6 percent of the cases. But Kelley 
suggested that the elimination of this group should actually improve the 
reliability over what it would be for tetrachoric coefficients computed on 
the basis of the highest and lowest 50 percent of the same sample, and 
empirical evidence, such as that reported above, confirms this expectation. 

Data concerning the reliability of item-criterion correlation coefficients 


obtained by means of the Flanagan table have been published {43; 44). For 
86 items pertaining to information about flying, the writer computed two 
sets of item-test correlation coefficients, using two comparable samples of 
370 aviation students. Each student tried every item. The high-scoring 
and low-scoring groups were chosen on the basis of the total score derived 
from 81 of the 86 items. The correlation of the two sets of item-test 
correlation coefficients proved to be .67.® The standard error of measure- 
ment in this particular group of item discrimination indices was approxi- 
mately .08. 

In those rare circumstances when the use of the product-moment r, the 
product-moment biserial r, or the phi coefficient as measures of discriminat- 
ing ability is appropriate, the obtained coefficients have certain inherent 
properties of great value.^*^ Consequently, the test constructor has no desire 
to express them otherwise. However, if the data take the form of biserial 
r's, estimates of biserial r's from Flanagan's table, or tetrachoric r's, the test 
constructor may regard them as indices of item discriminating ability 
essentially unaffected by differences in the level of difficulty of the items or 
in the variability of criterion scores for the group that reaches successive 
items. He may then wish to transform them into more nearly an interval 
scale of discriminating power. 

It is well known that it becomes more and more difficult to raise a 
correlation coefficient by a certain number of hundredths as perfect cor- 
relation is approached. A difference of .05 between correlation coefficients 
of .90 and .95 represents a far greater difference in relationship than a 
difference of .05 between coefficients of .05 and .10. This is not a serious 
limitation for item analysis purposes, but in the case of product-moment 
coefficients it is so easily removed that there is no reason to tolerate it if it 
works any inconvenience. To satisfy the special requirements of test editors 
for an index of discriminating power that is substantially comparable from 
item to item and that is easy to use, the writer has suggested indices that 
constitute a linear function of Fisher's z and range from 0-99, thus 
eliminating decimals. They have properties that permit them legitimately 
to be added, subtracted, and averaged; their variance errors are virtually 
identical regardless of their magnitudes when they are based on samples 
of the same size or essentially the same size; and the units in which they 
are expressed are sufficiently coarse to discourage an impression of ex- 

'Mean of set A = .29; SD of set A = .15; mean of set B = .28; SD of set B 
= .13; N = 86. 

^*This is not true of the phi coefficient if it has been obtained from an abac such as 
Guilford's, unless the high-scoring and low-scoring groups each consist of 50 percent of the 


treme precision of measurement, yet fine enough to satisfy ail practical 
requirements of test construction. 

To minimize the labor required for obtaining item analysis data, a chart 
has been prepared which, like the Flanagan table, is entered with percents 
of correct responses obtained by examinees in the lowest 27 percent and 
the highest 27 percent of the criterion-score distribution. Unlike the Flana- 
gan table, however, this chart yields for most usable test items both an 
index of discriminating power and an index of difficulty in one operation.^^ 
These indices possess the properties recommended previously in this 
chapter. Their variance errors, like those of the correlation coefficients 
derived from Flanagan's table, cannot now be obtained analytically, but 
it is known that they are more reliably determined than tetrachoric r's and 
less reliably determined than biserial r's. Empirical evidence concerning the 
correlation between two groups of item discrimination indices obtained 
from the Davis item analysis chart was presented by the writer in 
1946 (42). For 86 items the correlation coefficient based on two compara- 
ble samples of 370 testees was .58.^" When the percents used to enter the 
chart were not corrected for chance, the corresponding correlation coefficient 
became .65.^^ 

The correlation of the discrimination indices with difficulty indices for 
the same items was 41 ii^ sample A when the percents used to enter the 
chart were corrected for chance. When tlie correction was not made, the 
correlation of the resulting sets of indices in sample A was .41. These data 
show that correction for chance greatly reduced the correlation between 
these discrimination and difficulty indices and suggest that when the Davis 
item analysis chart is used in the way recommended, it yields discrimina- 
tion and difficulty indices that are essentially independent. 

Unquestionably, improved methods of item analysis will be developed. 
Application of the principles of sequential sampling may provide one of 
these. In fact. Walker has found that this technique does efficiently sepa- 
rate, at a designated level of confidence, items that have a significant rela- 
tionship with the criterion from those that do not {36; 37). Information 
comparable to that supplied by the more familiar critical-ratio, chi-square, 
or chi tests is provided, but the sequential-sampling method is said greatly 
to reduce the labor involved if specially prepared tables are used. Some 
of the latter have already been made available. 

" Computational procedures for use with this chart have been provided in great detail. 
(7, chap. 5). 

^^Mean (sample A) = 25.14; mean (sample B) = 24.10; SD (sample A) = 14.30; 
SD (sample B) = 11.75; N = 86. 

"Mean (sample A) = 18.73; mean (sample B) = 18.03; SD (sample A) = 10.14; 
SD (sample B) = 7.85 ; N = 86. 


The use of factorial methods for item analysis purposes awaits the de- 
velopment of electronic computers capable of extracting factors from large 
matrices of intercorrelations with great rapidity. As soon as these intru- 
ments become widely available, it may be that items will be selected on the 
basis of their factor loadings in the principal components of the matrix of 
intercorrelations obtained by intercorrelating all the items in a test. Many 
technical and practical problems will have to be solved, however, before 
factorial methods will yield meaningful data for item analysis purposes. If 
product-moment correlation coefficients are used in the matrix to be 
analyzed, the varying difficulties of the items will create trouble, as indi- 
cated by Lawley (80) . If tetrachoric coefficients are used, their unreliability 
and lack of susceptibility to precise tests of significance will raise awkward 
problems. In any event, we can look forward with interest to developments 
in the field of item analysis that even now we can see taking shape. 

Factors Affecting the Interpretation of Item 
Discrinrsination Indices 

Spurious Correlation in Internal-Consistency Data 

When the total score on a test is used as the criterion variable for evalu- 
ating each individual item in the test, the critical ratios, chi squares, chi's, 
or item-test correlation coefficients that are computed are spuriously high. 
This results from the fact that each item is part of the criterion with which 
it is correlated or compared. One way of eliminating this overlapping Is to 
score the test as many times as there are items in it and to use as the cri- 
terion for evaluating each item a total score from which it has been ex- 
cluded. However, this procedure is impractical because it requires an ex- 
penditure of labor out of all proportion to the benefit derived from it. 
Unfortunately, there is no statistical technique by which the effect of the 
overlapping can be accurately removed with an increase in computational 
labor small enough to justify the resulting benefit. ^^ The best that can be 
done is to indicate what the order of magnitude of the spurious correlation 
is likely to be and point out that the relative magnitudes of the item dis- 
crimination indices are affected less than their absolute magnitudes. 

The smaller the number of items in the total score used as a criterion 
variable, the larger wdll be the spurious item-criterion relationship. Conse- 
quently, items should be tried out in large groups when internal-consistency 
item analysis data are to be obtained. It is evident that the length of the 
tryout test is usually an important consideration in the interpretation of 
internal-consistency item discrimination indices. 

To remove the spurious element from biserial coefficients, see 38, equation 4. 


For discrimination indices expressed as correlation coefficients, tlie mag- 
nitude of the spurious element can be calculated precisely for two limiting 
conditions (7) . If a set of items are all of the same level of difficulty (what- 
ever that may be), it can be shown that when the intercorrelations are all 

zero, the spurious item-criterion correlation coefficient will be 7=^ , where 

n equals the number of items. For a 100-item test, therefore, the spurious 
correlation will be .10. If the item intercorrelations are all unity, the item- 
criterion correlation will necessarily be unity and no spurious element will 
be present. As this illustration suggests, the spurious element in the item- 
criterion coefficient decreases as the average item intercorrelation increases. 
Fortunately, the size of the spurious element drops rapidly with the first 
few hundredths of item intercorrelation. One other element that is impor- 
tant in determining the amount of spurious correlation is the difficulty of 
the item. The spurious element is maximized when an item is of 50 percent 
difficulty; the discrimination indices for easy and hard items need not be 
discounted so much as those for items of median difficulty. 

Position of an Item in the Tryout Form 

Generous time limits should always be provided so that all, or almost 
all, of the examinees will have a chance to try every item. However, it is 
sometimes impossible to arrange administrative conditions to make this 
readily possible. Then, items near the end of the tryout test may not be 
reached by some examinees. This leads to a reduction in the size of the 
sample on which item analysis data for such items can be based; this re- 
duction leads to an increase in the variance errors of difficulty and dis- 
crimination indices for these items. Thus, in some instances the reliability 
of the discrimination index for an item depends on the position of the 
item in the try-out test. The test constructor must keep this point in mind 
because if the number in the sample that has not read successive items 
becomes sufficiently large, the item analysis data that can be computed may 
become so unreliable as to be not worth obtaining. 

Even more serious than this progressive reduction in the reliability of 
item analysis data for successive items in a highly speeded tryout test is 
the systematic bias that may characterize the difficulty and discrimination 
indices, particularly the latter. Data biased in this way can be gravely mis- 
leading except to the most sophisticated test constructors, and even for 
the latter they are inconvenient to use. This bias may be caused by a tend- 
ency which the writer has observed for some examinees to rush through 
a test, marking items almost at random when they come to the more diffi- 


cult items. In general, these examinees tend to be those of low rather than 
of high ability. The result is that the group of examinees who actually 
attempt items near the end of a highly speeded test consists of the ex- 
ceedingly able, well-informed examinees who work rapidly and accurately 
and the dull, poorly informed examinees who rush through the items, 
marking answers almost at random. Thus, the distribution of criterion 
scores for the group that attempts items near the end of a highly speeded 
test tends to become platykurtic. For this reason, item-criterion correlation 
coefficients tend to become inflated for these items. Both Wesman and 
Mollenkopf have published data showing that this does occur {66; 61). 
Mollenkopf writes, 

The evidence of this study indicates that the best persons (in terms of 
criterion scores) attempt the most items, and usually are successful. The 
even spread of scores down into the low range indicates that some persons 
of low and average ability also mark late items. Two possible explanations 
are suggested why these persons of low ability thus hurry through the test: 
{a) they do not recognize their own limitations, believing themselves to be 
more capable than they really are, and {b) they are test-wise individuals 
who hope to better their scores by attempting a great many items. 

This interpretation agrees with the writer's and suggests another reason 
for telling examinees not to guess wildly and for using an appropriate cor- 
rection for chance in the scoring of tryout tests. 

It has sometimes been proposed, when product-moment or biserial cor- 
relation coefficients have been computed between item and criterion scores, 
that a correction be made for any alteration in range of criterion scores for 
examinees who actually reach items near the end of a speeded tryout test. If 
a test constructor decides to apply such a correction, he should be careful 
to make use of an appropriate procedure. The basic formula derived by 
Pearson (and discussed by Kelley [^101; p. 430}) for correcting a product- 
moment correlation coefficient for restriction of range on the basis of one 
of the correlated variables is not applicable to biserial correlation coefficients 
of the type most commonly used for item analysis purposes. A procedure 
developed in 1946 by Gillman and Goode (48) may be found reasonably 

If the writer's recommendations for estimating the number of examinees 
who fail to reach each successive item and for using this information in 
computing difficulty and discrimination indices are followed, a small 
progressive reduction in the number of cases on which the item-analysis 
data are based will not seriously bias the indices. If the total number in 
the sample is used or if a distinction is not made between examinees who 


have read an item and refrained from marking it and tiiose who have 
failed to reach the item in the time limit, the indices can be strongly biased 
merely because of the position of the item in the tryout form. 

An example is provided by a five-choice item that was not reached 
in the time limit by approximately one-quarter of the sample. The data for 
the correct answer were : 

Number in Highest Number in Lowest 

27 Percent 27 Percent 

Right 95 19 

Wrong 2 6 

Omit 2 28 

Not reached 1 47 

Total 100 100 

Item analysis data obtained on the basis of three different formulas fol- 

Computing Formulas 

R R R 

4 _ 

R + W N 

N — NR 
Item-test correlation 
coefficient read from 

Flanagan table 69 .50 .76 

Difficulty index (average 
of percents in high 
and low groups) 64 87 57 

Criterion-Score Reliability 

Any lack of reliability in the criterion variable affects all item-criterion 
coefficients in the same manner (though not by the same amount). Correc- 
tion of these coefficients for lack of perfect reliability in the criterion will 
have no effect on their rank order, but it will render them all- somewhat 
more unreliable. It will also impair estimates of the accuracy of prediction 
achieved by using them in combination. If, for some reason, the reliability 
of the criterion used for evaluating items in the tryout form of the test were 
altered considerably (without any change in the mental or physical skills 
measured) when the items were used for practical purposes, estimates of 
their prediction accuracy under the changed conditions could be secured 
by means of the following formula: 

Ric = riaA/-^^ (22) 

V rcc 

where Ric = the item-criterion correlation coefficient when the criterion 

reliability coefficient is Rcc, 



fee = the original criterion reliability coefficient, 
Rcc = the altered criterion reliability coefficient, 
fie = the item-criterion correlation coefficient when the criterion 
reliability coefficient is fee- 

The Use of Item Analysis Data 

Revision of Test Items 
The previous discussion of item analysis data in this chapter has con- 
cerned the tabulation of right answers, wrong answers, and the number of 
examinees who did not reach each item. These tabulations are indeed suffi- 
cient to provide data for computing difficulty and discrimination indices, but 
they must be supplemented with choice-by-choice tabulations if the full 
value of item analysis techniques is to be realized. The exact form of these 
choice-by-choice tabulations depends on the type of statistics being used for 
difficulty and discrimination indices. Following is a choice-by-choice tabu- 
lation of the kind of data recommended by the writer. The item used for 
illustrative purposes is based on a reading passage that is not reproduced 

HIGH 27% 

LOW 27% 

Percent Se- 

Percent Se- 

lecting Choice 

lecting Choice 

(N = 100) 

(N = 100) 

13. Which one of the following 

words does not seem to go 

with the tone of the passage? 

A Monopoly (line 2) 



B Renaissance (line 10) 



C Lucky (line 12) 



D Dozen (line 14) 



*E Jackpot (line 22) 



Omitted item 


Did not reach item 

* Correct answer. 

It will be noted that 49 percent of the high group selected the correct an- 
swer to this item, choice E, while only 26 percent of the low group selected 
it. After these two percents have been corrected for chance success, they are 
used to enter the Flanagan table and to obtain the estimated percent of 
examinees in the entire sample who know the answer to the item. For pur- 
poses of obtaining information that will permit revising the item, we are 
as much interested in the percents in the high and low groups that select 
the incorrect choices as we are in the percents that select the correct choice. 
Note that incorrect choices A, B, and D are more attractive to the examinees 
in the low group than in the high group. This is just what the item writer 
hoped would be the case. Incorrect choice C, however, does not work so 


well. A slightly larger percent of the high group than of the low group 
select it. Consideration of the reasons for this indicates the probabilitj^ that 
examinees who are high in general reading ability regard "lucky" as a col- 
loquial word. It isn't, of course — certainly not in the same way that "jack- 
pot" is — but "lucky" does not serve as a discriminating incorrect choice, so 
the item would be improved by replacing "lucky" with a word that would 
constitute a more discriminating incorrect choice. Psychological insight and 
ingenuity would have to be exercised to secure a replacement. Careful con- 
sideration of the item analysis data and its relation to the passage on which 
the item is based would be required for this purpose. Comments of expert 
critics should be consulted to make sure that some previously undetected 
ambiguity in the item is not misleading the examinees in connection with 
choice C. 

Inspection of choice-by-choice data of the kind illustrated above will 
ordinarily reveal many incorrect choices that are discriminative in the wrong 
direction and many that attract virtually no examinees. The former operate 
to destroy an item's discriminating power while the latter are nonfunctioning 
and may waste space and reading time. When a test constructor is provided 
with data of these kinds, he may delete choices that are grossly invalid and 
nonfunctioning. The efficiency of measurement of items that have been 
pruned in this way may be considerably improved. 

Ordinarily, however, the test constructor will not be content to make so 
limited a use of the revealing data provided by a choice-by-choice item 
analysis. Guided by the hints offered in the data concerning the mental proc- 
esses of the examinees, he can often replace invalid and nonfunctioning 
choices with a good chance of improving the efficiency of measurement of 
the item. Whether he has succeeded can be determined only by administering 
the revised items to samples like those used for the initial tryout. The 
expense and labor involved in tryouts or item analyses may make more than 
one tryout impracticable. 

Occasionally, an incorrect choice that is markedly discriminative in the 
wrong direction represents a concept that cannot be removed without de- 
stroying the whole point of the item. This situation can arise as a result of 
testing a point about which considerable misinformation has been circu- 
lated. In the writer's opinion, if revision of an item involves destroying its 
point, the item either should be discarded or should be used without re- 
vision in spite of its probable lack of efficiency. An item should never be 
subjected to revision that destroys its point. Items that are not discrimina- 
tive in the entire tryout sample will often be found capable of discriminating 
between the very best of the examinees and all of the others or between 



the least capable and all of the others. Such items can be highly useful for 
specialized purposes. 

A second illustration of the improvement in test items that may be ob- 
tained by editorial revision based on item analysis data is provided by the 
data in Table 7. Here we have detailed information showing the validity 
of each choice in an item before and after revision. In this case, the cri- 
terion was success or failure in learning to fly an airplane well enough 
to graduate from elementary flying training in the Army Air Forces. Note 

Item Analysis Data for a Test Item before and after Revision 

Original Item* 

Percent of 

Percent of 


(N = 84) 

Revised Item* 

Percent of 



Percent of 

The valve cup is 
probably made of 
A spring steel. 
*B composition 
C cork. 
D tin. 
E bakelite. 

Did not reach 









The valve cup is 
probably made of 
A celluloid. 
*B composition 

C cork, 
D tin. 
E bakelite. 

Did not reach 









Percent answering correctly = 49 
rtet with criterion = .00 

Percent answering correctly = 61 
rtet with criterion = .10 

* Both original and revised items were based on the same diagram, which is not reproduced here. The asterisk 
in front of choice B indicates that it was the correct answer to each item. 

that, in the original item, choices A and B attracted virtually all of the ex- 
aminees and that the percentages of graduates and eliminees selecting 
choice B (the correct answer) were the same. Hence, the item had no 
correlation with the criterion. In an effort to improve the item, choice A 
was changed from "spring steel" to "celluloid." The latter constituted a 
far less attractive choice to similar examinees in the second tryout group. 
The distribution of answers changed in such a way that the item became 
slightly easier and displayed some validity. It cannot be said that the change 
was strikingly successful, but it was accompanied by an increase in the 
item's validity coefiicient. 

Needless to say, these data should not be considered as proof of any- 
thing. Too few cases are involved for that. But they do illustrate one im- 
portant use of item analysis data. Test constructors accustomed to dealing 
with internal-consistency item analysis data may find the item-criterion 


correlation of .10 very low compared with the values of .40 to .60 that 
they commonly obtain. It must be remembered, however, that such high 
values are rarely encountered when the criterion is not the total test score, 
and that they are almost never encountered when the criterion is a realistic 
one — such as performance in a course of training or a job. 

It is interesting that many invalid distracters are found in items that 
have been carefully edited and checked by subject-matter experts. This 
emphasizes the well-known fact that because a distracter is discriminative 
in the wrong direction we cannot conclude that it is too nearly correct from 
a factual point of view. Conversely, the fact that an incorrect choice is too 
nearly a correct answer does not necessarily mean that it will turn out to 
be discriminative in the wrong direction when it is subjected to item 
analysis. Item analysis techniques cannot alone be relied upon to detect 
errors and ambiguities; expert criticism and editing are indispensable in 
test construction. The full value of item analysis techniques cannot be 
realized unless criticisms of the items by recognized authorities are available 
for reference. 

The Relationship of Item Difficulty to 
Item Discriminating Power 

The relationships of item difficulty and item discriminating power are 
rather complicated and little understood, but, since a thorough compre- 
hension of their relationships is basic to an intelligent selection of test 
items, it is important that an explanation of the matter be presented here. 
The magnitude of a product-moment item-criterion correlation coefficient 
is dependent on two separate elements: the underlying relationship of 
the variables measured by the item and the criterion, and the number of 
discriminations the item can make among the members of a given sample. 
If we assume rectilinearity of regression, the underlying relationships can 
properly be estimated by statistics such as the biserial or tetrachoric correla- 
tions, or by short-cut procedures that lead to the use of the Flanagan table 
or the Davis item analysis chart. The number of possible discriminations 
can be determined from the percent of exam.inees who answer the item 

When a product-moment correlation coefficient is computed between item 
and criterion scores, its value represents a combination of the influences 
produced by the two factors mentioned. As stated previously, its value rises 
as the underlying relationship of the variables increases and as the number 
of discriminations the item can make approaches its maximum at the 
difficulty level where 50 percent of the examinees know the answer. For a 


given degree of underlying relationship between the item and criterion, the 
product-moment correlation is maximized at the 50 percent difficulty level. 
If the item is to be used alone or in combination with others for prediction 
purposes among examinees having the same average and distribution of 
ability as those in the tryout group, no other statistic (based on rectilinear 
regression) will serve so well as the product-moment coefficient. But if we 
wish to generalize the findings obtained in one group to another having a 
different average level of ability or a different point of dichotomy with 
respect to the criterion variable, we can improve on the results of using a 
product-moment coefficient in the original group by using data pertaining 
separately to the two factors entering into the product-moment r; namely, 
the underlying relationship and the difficulty level. If a single test item is 
answered correctly by 50 percent of a group of 100 examinees, it dis- 
criminates between the 50 who pass it and the 50 who fail it. The total 
number of discriminations made by the item is, therefore, 50 X 50, or 
2,500. It is obvious that this is the largest number of discriminations that 
the item can make. Note that if 40 percent of the 100 examinees pass the 
item, it will make 40 X 60, or 2,400, discriminations. If 1 percent of the 
100 examinees pass the item, it will make only 1 X 99, or 99, discrimina- 
tions. These data illustrate the fact that if we were to build a test consisting 
of one item (which we never do), it should be of 50 percent difficulty for 
the sample in which it is to be used if it is to be maximally discriminative. 
But notice that a single-item test is useless if we want to discriminate be- 
tween the ability of two examinees who fail it. If we want to discriminate 
between examinees capable of passing an item at the 30 percent difficulty 
level and those not capable of doing so, we have to employ an item of 30 
percent difficulty level. This item will make only 30 X 70, or 2,100, dis- 
criminations in the sample, but they will be useful discriminations. These 
data indicate that the purpose for which a test is to be used is more im- 
portant in determining the level of difficulty of its component items than 
are other considerations. 

Richardson has shown that if one wants to construct a test to differentiate 
examinees above a given level of ability from those below that level of 
ability (without making any distinctions among examinees in the two 
groups), all of the items used in the test should be of a difficulty level 
such that they will be marked correctly by half of the examinees at the level 
of ability represented by the li?2e of demarcation {82). His mathematical 
demonstration of the point fits in nicely with the practical illustration pro- 
vided above. This is an important consideration in selecting test items be- 
cause examinations are sometimes built for the purpose of separating 


examinees into two groups. Selection tests used by industrial organizations 
are often used with passing marks. For a test of this kind we should 
select items at a certain dif&culty level so that the maximum discriminating 
power of the test will be exerted at the passing mark. The items will all 
be of 50 percent difficulty only when we wish to exclude 50 percent of the 
examinees. Items of 50 percent difficulty level have, in general, no peculiar 

Most examinations used in schools are not employed mainly to divide 
students into two groups at the passing mark. Ordinarily, distinctions must 
be made among passers and failers in order to assign marks (say, A, B, C, D, 
and E). Discriminations must be made throughout the range of scores to 
rank students in order of ability, though better-than-average discrimination 
power should, theoretically, be exerted at the dividing points between the 
scores assigned A and B, B and C, etc. 

Suppose that we now consider a test of 10 items to be used with a 
group of 100 examinees. If the items are all uncorrelated with one another, 
the number of discriminations will be maximized when the items are all 
of 50 percent difficulty. If all items were perfectly correlated (and thus 
perfectly reliable), the number of discriminations made by 10 items at 50 
percent difficulty level would be identical with the number of discrimina- 
tions made by one item of 50 percent difficulty. The maximum number of 
discriminations that could be made if the 10 items were all at one level 
of difficulty and perfectly intercorrelated would be 2,500, but if we spread 
the 10 items at intervals of 9.09 percent from 9.09 percent to 90.90 per- 
cent, 4,562 discriminations could be made. The latter arrangement would 
be optimal for 10 items under the circumstances specified. 

These data suggest that maximum discrimination among all the members 
of a group may be obtained when the items in a test are uncorrelated by 
using items all of 50 percent difficulty, but that when the items are per- 
fectly correlated, the items should be spread over the range of difficulty. 
However, the items in a test are never found to be either wholly uncor- 
related or perfectly correlated. Hence, the limiting cases we have used for 
illustrations serve merely to guide our thinking about the distribution of 
item difficulties required to obtain maximum discrimination through- 
out the entire range of scores in a given sample. The analytic solution to 
this problem for any given level of item difficulty and item intercorrela- 
tion has been discussed by the writer, but the computational labor re- 
quired to provide a practical guide for the test constructor has not yet been 
performed [72). We can see intuitively, however, that since many kinds 
of test items have low intercorrelations, a distribution of item difficulties 


clustered around the 50 percent level would often approximate the dis- 
tribution required to obtain maximum discrimination throughout the 
range of scores. For vocabulary items and other types that tend to have 
relatively high intercorrelations, the distribution of difficulty indices should 
be made more platykurtic than usual if equal accuracy of measurement and 
maximum discrimination are desired throughout the range of scores. 

It should be made clear at this point that there is often good reason for 
avoiding a distribution of item difficulties that will cause a test to yield 
equal accuracy of measurement throughout the range. The separation of a 
group of examinees into two subgroups calls for all possible accuracy of 
measurement at the dividing line; the assignment of marks (which calls 
for the division of a group into several parts ) demands maximum accuracy 
of measurement at the several dividing points scattered along the range of 
scores. Even when the members of a group are to be placed in rank order, 
the purpose for which the rank order is intended usually dictates a portion 
of the range that deserves greater accuracy of measurement than another. 
For example, in selecting teachers for a large school system, the qualifying 
examination should yield greater accuracy in the rank order of the applicants 
who obtain high scores since it is unlikely that applicants who obtain low 
scores will be given serious consideration unless there is a grave shortage 
of applicants for teaching positions. To select a half-dozen high school 
graduates to be given university scholarships calls for extraordinary accuracy 
of the rank order at the extreme upper end of the distribution of scores. A 
test designed for this purpose should be made up almost entirely of ex- 
ceptionally difficult items in each of the fields measured. ^^ 

Brogden has presented unusually interesting data regarding the effect 
on test validity of deliberate control of item difficulty and intercorrela- 
tion (69). These confirm the somewhat theoretical formulation presented 
above. For example, from Brogden's data we find that if a test of 45 items 
(all of 50 percent diffi-culty and having tetrachoric intercorrelations of .60) 
has a correlation with a criterion variable of .950, we can increase this cor- 
relation to .961 by adding 108 similar items. But we can obtain a correla- 
tion of .962 by using only 45 items like the originals except that their diffi- 
culty indices form a rectangular instead of a point distribution. Thus, 
under these unusual conditions, a test of 45 items fairly well adjusted for 
difficulty level may be as useful as a test of 153 items all of 50 percent 

From the preceding discussion it is apparent that the distribution of 
difficulty indices should be controlled very closely if maximum efficiency 

" For a rigorous treatment of this issue see "Selected References," p. 327, item 85. 


in measurement is to be attained. When tiiis direct control is applied by 
means of difficulty indices, it is desirable to select those items of suitable 
difficulty that have high discrimination indices and that meet the approval 
of subject-matter specialists. For this purpose, discrimination indices that 
reflect the underlying relationships of the item and criterion variables are 
to be preferred to product-moment coefficients. 

Prinqples of Selecting Items for Specific Types of Tests 

For purposes of this discussion all tests may be divided into two groups 
which we shall label "self -defining tests" and "predictor tests." The self- 
defining test is so named because the weighted sum of the skills and 
abilities measured by it actually defines what it measures. Achievement 
tests or conventional intelligence tests are common examples of self- 
defining tests. More unusual self-defining tests are Professor Thurstone's 
Tests of Primary Mental Abilities. All of these tests are constructed to 
measure a combination of skills and abilities specified in advance by the 
test constructor. A great deal of care and skill is exercised to make sure 
that the items included in a self-defining test actually measure what the 
test constructor wants the test to measure, but in the last analysis it is 
self-defining. It should be noted that self-defining tests are often used for 
prediction purposes, in which case they may be regarded as predictor tests. 

A predictor test is one that is constructed on the basis of empirical data 
to correlate as highly as possible with a criterion variable.^^ Once the items 
for a predictor test have been tried out, selection of the items for the final 
form may, theoretically, be determined by multiple regression techniques 
or an approximation of them. The subjective judgment of the test con- 
structor in the selection of items is reduced to a minimum. 

Self-definmg tests 

The power test. — A power test is intended to measure level of perform- 
ance with respect to some defined skill or ability, or some specified com- 
bination of skills and abilities. The first step in constructing any self- 
defining test is to define the content to be measured in such a way that 
recognized authorities in the field will agree that the definition is adequate. 
This definition is best prepared in the form of an outline in which each 
topic is weighted in proportion to its importance, as judged by analyses of 
textbooks and curriculums and the pooled opinions of experts (see 
chapter 6 ) . 

"A suppressor test is a special kind of predictor test that is constructed to have a high 
correlation with an existing predictor and a low correlation with the criterion; see "Selected 
References," p. 328, item 104. 


If the items in a tryout test have been well constructed and are well 
apportioned among the skills and abilities to be measured, the total score 
may be regarded as a valid, if perhaps inefficient, measure of the intended 
subject-matter field. In other words, the test will be regarded as a valid 
measure by experts qualified to make such judgments. Given a tryout test 
of this kind, selection of items for the final form may be accomplished as 

1. A difficulty index should be computed for each item. 

2. A discrimination index should be computed for each item, preferably 
an index that reflects the underlying item-criterion relationship. Since the 
total test score is used as the criterion, biserial r may be computed or one 
of the tables mentioned previously may be employed. 

3. Consideration should be given to possible differences in the average 
level of ability between the tryout group and various groups in which the 
final test is likely to be used. If difficulty indices are read from the Davis 
table, any difiference that is presumed to exist may be expressed as a 
constant number of points of difficulty. 

4. The number of items desired at each level of difficulty should be esti- 
mated. The shape of this distribution will depend on the purpose of the 
test and on the degree of homogeneity of the items. If the test is to be 
used with a "passing mark" and no further distinctions are to be made, all 
items should be of an appropriate difficulty level. If 30 percent of the 
applicants of ability corresponding to the predetermined average level of 
difficulty are to be rejected, the items should all be of a difficulty level 
corresponding to that represented by a difficulty index of 70 percent among 
the applicants. This procedure assumes that the underlying item-criterion 
relationships are equal. In practice, items above and below this level will 
have to be used since not enough items of precisely the desired level of 
difficulty are likely to be obtained that meet other requirements for selection. 

To obtain maximum discrimination among all the examinees, the shape 
of the distribution of items around the 50 percent level (among the group 
in which the test is to be used) should be made leptokurtic if the items are 
rather heterogeneous in content and platykurtic if the items are homo- 
geneous. This procedure will tend to maximize the correlation between ob- 
tained and true scores on the final form of the test; for self-defining tests, 
this correlation may be regarded as the validity coefficient of the test. The 
degree of homogeneity of well-edited items can be estimated by the 
average magnitude of the discrimination indices. An average biserial r of 
.60 to .70 (when the number of items in the criterion score is about 100) 
may be considered indicative of rather homogeneous items. The correspond- 


ing limits for Davis discrimination indices are about 40 to 55. An average 
biserial r of .20 to .30 (when the criterion score includes about 100 items) 
is characteristic of rather heterogeneous items. The corresponding limits 
for Davis discrimination indices are 12 to 19. 

If considerably greater accuracy of measurement is desired in one part of 
the range of difficulty than in another, items should be concentrated in 
that part of range. 

5. All the tryout items should next be separated into groups indicated 
in the outline of the test. From each separate group, a number of items 
should be selected tentatively that will be roughly proportional to the 
weight given each division in the test outline and that will create approxi- 
mately the proper distribution of item difficulty indices. At the same time, 
an effort should be made to choose those items having tlie highest discrimi- 
nation indices. Needless to say, any item that has not been approved by 
subject-matter experts should be excluded from consideration. 

This is a complicated process; compromises have to be made within 
each division of the test outline, though sometimes compensating compro- 
mises can be arranged in other divisions. After long experience, the test 
constructor gets a "feel" for the process and learns how much compromise 
is either necessary or permissible under the particular circumstances. The 
test constructor cannot be a perfectionist when he is selecting items. 

6. The entire group of test items should be read over as a unit to detect 
unnoticed overlappings of choices and to prevent cross-keying of items; 
that is, having the statement or choices in one item give a clue to the answer to 
another item. 

7. The choice-by-choice item analysis data for each item should be 
studied and changes made in choices that are likely, in the test constructor's 
judgment, to improve the item's efficiency." Ordinarily, revision of items in 
this manner may be expected to lower the difficulty index (or increase the 
difficulty) of an item in which an unattractive incorrect choice is replaced 
with one judged to be more attractive to examinees who obtain low scores 
on the criterion variable and to raise the difficulty index (or decrease the 
difficulty) of an item in which nondiscriminating choices are replaced with 
ones judged to be less attractive to examinees who obtain high scores on 
the criterion variable. The test constructor should try to make allowances 
for the estimated net effect of these adjustments in item difficulty. 

8. The items should be grouped appropriately, arranged in order of 

" These changes should ordinarily be made even though a subsequent tryout will not be 
conducted to ascertain their efficiency. It would be wasteful to refrain from making the 
maximum use of all available data in revising items even though a certain risk may be 
involved. Whenever possible, of course, a second tryout of revised items should be made. 


difficulty, and have choices transposed to provide roughly an equal number 
of correct responses for each choice number or letter. 

It may seem surprising that no more emphasis is placed on the use of 
item discrimination indices than is indicated by the procedures described 
above. The writer believes, however, that their importance has often been 
greatly overemphasized in the construction of self-defining tests. Broadly 
speaking, they should be given more weight when the items in the tryout 
test prove to be extremely homogeneous. As the items making up the 
criterion score become less homogeneous, less weight should be given to 
the discrimination indices in selecting items. 

The speed test. — A speed test should, strictly, comprise items that are all 
so easy that, given time, all examinees could answer them correctly. Thus, 
scores derived from a speed test will reflect mainly the rapidity with which 
examinees answer the items. The selection of items for speed tests is some- 
times based partly on data provided by administering the items with suf- 
ficient time so that every examinee has a chance to try every item. Items of 
the desired level of difficulty are identified in this manner. 

The mastery test. — A mastery test is a special kind of test used when it is 
desired to determine the proportion of essential subject matter a pupil has 
learned. The content is determined by the judgment of experts, as in the 
case of ordinary tests. However, since the purpose is to measure only 
abilities or skills that should have been mastered by every pupil, the items 
are necessarily extremely easy and difficulty indices are not used to exclude 
items that are answered correctly by 100 percent, or almost 100 percent, of 
the tryout groups. Such items do not discriminate among the individuals 
tested and are ordinarily omitted from the final form of a test used to rank 
individuals rather than to ascertain their mastery of certain subject matter. 
Correlations between success or failure on each item and the criterion score 
are necessarily zero if everyone answers each correctly. Furthermore, these 
correlations are not very meaningful for items that are answered correctly 
by almost everyone. Hence, item analysis data are not especially serviceable 
for selecting items for mastery tests. 

Predictor tests 

The simple prediction test. — A simple prediction test is one used to 
provide an estimate of the rank order of a group of examinees with respect 
to a designated criterion variable. For example, a civil service examination in 
shorthand is intended to rank the applicants for positions as stenographers 
with respect to their expected facility in taking dictation on the job. 

To select the best combination of items for a simple selection test, we 


need data showing the correlation of eacli type of item with the criterion 
and with tlie other types of items that have been tried out. The types of 
items that have higher-than-average correlations with the criterion and 
lower-than-average correlations with one another represent the most prom- 
ising sources of items for the final form of the test. Consequently, we 
select from the proper groups those individual items that are of the desired 
level of difficulty, that are judged satisfactory by subject-matter experts, and 
that have the highest correlations with the criterion. If item analysis data 
are available with the total score on all the items, some preference should 
be given to items that have the lowest internal-consistency discrimination 
indices. To check on the validity of the resulting final form, an entirely 
new sample of examinees must be tested. The importance of checking 
validity coefficients on a sample different from that used in constructing 
the test cannot be overemphasized. 

In actual practice, the selection of items for a simple prediction test may 
be accomplished efficiently by obtaining two sets of item discrimination 
indices — one with the total score of the test in which the item has been 
tried employed as the criterion variable and one with the variable to be 
predicted employed as the criterion. If the test is to be used with groups 
in which the distribution and average level of ability are essentially the 
same as in the tryout samples, the item discrimination indices may be ex- 
pressed as product-moment correlation coefficients and items should be 
chosen to minimize the average of the internal-consistency coefficients and 
maximize the average of the coefficients with the variable to be predicted. 
The procedures described by Horst (93), Flanagan (88), or by Richard- 
son and Adkins (96) are reasonably effective methods of accomplishing 
this. Gulliksen has reported data concerning the validity of a test con- 
structed in accordance with Horst's technique (89). 

If what is desired is minimum over-all error of prediction in the least- 
squares sense, the difficulty indices of items chosen in accordance with 
Horst's technique need not be considered at all under the circumstances 
specified. Often, however, the group with which the test is to be used is 
not of the same average level of ability as the group in which tryout data 
were obtained. Furthermore, the test constructor is not always interested in 
minimizing over-all errors in prediction; he may wish to predict scores 
within a certain part of the range, or at a particular point, with great 
accuracy even if this means he must accept rather unreliable predictions in 
other ranges of scores. To accomplish any of these objectives, the distribu- 
tion of item difficulty indices must be deliberately controlled; items must 
be selected for the final form in such a way as to obtain the desired distribu- 


tion of item difficulty indices, minimize item-test relationships, and maxi- 
mize item-criterion relationships. These relationships should be expressed 
as biserial f's, tetrachoric r's, Flanagan r's, or Davis discrimination indices. 

Whenever item analysis data are used to select items with minimum 
internal consistency, great care must be exercised to make sure that the 
low internal-consistency discrimination indices found for some items are 
the result of low relationships between the underlying variables measured 
by the items and by the total score used as a criterion and not the result of 
some ambiguity or other fault in the items themselves. To make sure of 
this, the criticisms of subject-matter experts should be consulted, and care- 
ful scrutiny of the items should be made. 

The effectiveness of selecting items by means of the procedure recom- 
mended for simple prediction tests depends on the following factors: first, 
the reliability of the discrimination indices; and second, the degree of cor- 
relation between the total test scores used as a criterion for the internal- 
consistency analysis and the variable to be predicted. If the indices are not 
highly reliable, selection of items having low internal-consistency discrimi- 
nation indices and high validity indices becomes largely a matter of chance 
and cannot be justified. The use of large samples of examinees is imperative 
when indices are to be used for this purpose. Samples of 1,000 examinees 
are unquestionably minimal. 

As the correlation between the total test score used as the criterion for 
the internal-consistency analysis and the variable to be predicted increases, 
it becomes less important to use internal-consistency data as a supplement 
to item-criterion data. The lower this correlation, the greater the gain in 
test validity that becomes possible by using two sets rather than one set of 
item analysis data. 

Some psychologists have expressed concern that the procedure for 
maximizing test validity recommended in this section will lead to an unde- 
sirable degree of heterogeneity in tests. There is no doubt that the procedure 
tends to increase item heterogeneity, but it also tends to maximize test 
validity. Since the latter is usually the proper standard by which the merit 
of a simple prediction test is to be judged, it seems hardly justifiable to 
condemn the heterogeneity of the items that leads to high validity. In 
actual practice, items must be selected from the tryout form of the test, 
which includes items judged to be appropriate on the basis of the test 
outline. Therefore, the advantageous heterogeneity produced by selecting 
items with low intercorrelations is held within close limits. 

The simple selection test.- — A test used for simple selection is only a 
special type of simple prediction test. Instead of ranking individuals in 


terms of expected performance with respect to a criterion variable, a simple 
selection test is designed to separate examinees into two groups — those to 
be accepted and those to be rejected. No further distinctions are to be made 
within the two groups. The methods used for selecting items are those 
appropriate in the case of simple prediction tests except that items are 
chosen so that tlieir difficulty indices will be as close as possible to the level 
of difficulty represented by the line of demarcation between the two groups 
to be formed. In other words, every item should be as nearly as possible of 
50 percent difficulty for examinees at the level of abjlky represented by the 
"passing mark." 

The multiple-prediction test. — A multiple-prediction test is used to 
obtain separate rank orders of performance for a group of examinees with 
respect to more than one criterion variable. Unless the various criteria to be 
predicted are highly intercorrelated, the multiple-prediction test must be 
composed of separately scored subtests. These subtests are correlated with 
each other and with each one of the variables to be predicted, and the effec- 
tive weights capable of minimizing the over-all error of prediction for each 
criterion variable are assigned to the subtests. 

For each subtest, items should be selected that measure the same mental 
function. Each subtest should approximate a "pure" test, so internal- 
consistency item analysis data should be used to select items for each subtest 
that have high correlations with the total score on that subtest and, if 
possible, low correlations with the total scores on other subtests. A descrip- 
tion of a test constructed in this manner was presented by Davis jn 
1947 (43; pp. 92-95). Theoretically, there is no reason why "pure" tests 
must have zero or low intercorrelations, as L. L. Thurstone has so well 
pointed out on many occasions. Nonetheless, there is some practical 
efficiency achieved by using item analysis data to minimize the intercorrela- 
tions of subtests employed to obtain weighted composite scores. 

The shape of the distribution of difficulty indices that is optimal for 
each subtest depends on the degree of intercorrelation of its component 
items, on the degree of intercorrelation of the subtest scores included in the 
composite, on the statistic used to express the indices, and on the purpose 
for which the composite scores are to be used. A detailed exposition of 
this matter is too long to be included in this chapter, but it can be said that, 
in general, the accuracy of measurement of scores in each subtest should be 
equal over a rather wide range of ability because subtests are generally 
selected partly on the basis of low intercorrelations; hence, a given indi- 
vidual's weighted composite score is apt to depend on subtest scores repre- 
senting rather different levels of ability. 


To obtain subtest scores having equal accuracy of measurement over a 
rather wide range, the principles recommended for this purpose in the 
preceding section on simple prediction tests should be followed. Since the 
items in a subtest are selected largely on the basis of homogeneity of 
content, their difficulty indices should cover a wide range rather than 
cluster close to the 50 percent level of difficulty. The use of only items of 
50 percent difficulty is more inappropriate than usual in the case of a so- 
called "pure" test unless the test is used with a "passing mark" at the 
level of ability represented by a difficulty index of 50. 

The multiple-selection test.- — Instead of providing separate rank orders 
for several criterion variables, the multiple-selection test is designed to 
separate the examinees into only two groups with respect to each criterion. 
It is analogous to the simple selection test; therefore, accuracy of measure- 
ment of the weighted composite scores derived from its subtests should be 
maximized at the several lines of demarcation. To accomplish this as effec- 
tively as in the case of a simple selection test is impossible. Reduction in 
the intercorrelations of the subtests lowers the extent to which accuracy of 
measurement in the composite scores can be concentrated at the line of 
demarcation. In practice, one can often do little better than utilize the 
principles of selecting items recommended for use with multiple-prediction 

The differential classification test. — A differential classification test is 
used to determine (at stated levels of confidence) in which one of several 
criterion variables any given person will show the highest level of com- 
petence. For example, if a man must be assigned to one of four available 
jobs, we use a differential classification test to determine in which one of 
the jobs he is likely to be most successful. Our objective in differential 
classification, then, is not to determine how well he is likely to be able to 
perform any skill that the four jobs have in common, but rather to deter- 
mine how well he is likely to perform the skills that are unique to each job. 
This means that an item should be assigned to one of the subtests in the 
differential classification test on the basis of differences between its cor- 
relations with the several criteria to be predicted and without regard for 
internal-consistency data. The larger the difference between the correlation 
of an item with one criterion and the median of its correlations with other 
criteria, the better the item. (The comparison of differences between cor- 
relation coefficients should be made only after their transformation into 
Fisher's z or Davis' discrimination index.) An excellent item for purposes 
of differential classification would be one that displayed a correlation of 
.40 with one job criterion and correlations of .11, .02, and — .07 with the 


other job criteria. There is no mathematical necessity that the subtests 
constructed in this fashion be "pure" tests, but the method used will tend 
to lower their intercorrelations. 

It is interesting to note that the common variance of the criterion 
variables is irrelevant for differential classification; hence, if factorial 
studies are performed on matrices made up entirely of them, it is best to 
analyze either the total variance or the non-chance variance. If com- 
munalities are used, the variance to be predicted for purposes of differential 
classification is likely to have been excluded in advance from the analysis. 

In practice, genuine differential classification tests are rarely sought 
because the scores resulting from their use should predict to a large degree 
only unique elements of each criterion variable. If a man were assigned to 
a certain job because his subtest scores on a differential classification test 
showed that he was more likely to succeed in it than in any other of the 
criterion jobs, he might still not perform satisfactorily because his absolute 
level of ability in the combined common and unique variance of the job 
might be too low. The differential classification test scores would have 
indicated only his relative standing in the criterion jobs; they would have 
indicated very little about his actual level of performance. 

This means that differential classification tests are appropriate when 
every examinee must be assigned to work in one of the criterion jobs or 
when all of the examinees have previously been selected by means of some 
sort of qualifying examination, perhaps one that measures largely the com- 
mon variance of the criterion jobs. If some testees are to be rejected for 
all of the jobs or if certain absolute levels of performance are set as 
minimal for acceptance in one or more of the criterion jobs, a multiple- 
selection test is used for the purpose. A differential classification test is 
likely to be of most social utility in time of national mobilization. Even in 
that situation, however, not everyone called up for national service is likely 
to be accepted. 

Recording Item Data 

The selection of items for the final form of a test can be greatly facili- 
tated if the data that must be used are available in convenient form. Be- 
cause the selection process usually involves combining items in several 
different ways before a satisfactory arrangement is found, it is most con- 
venient to have each item, together with all data pertaining to it, on a 
single filing card. Specially printed 5X8 cards are quite suitable for this 
purpose; they are large enough to carry all the data that are ordinarily 
required and small enough to be manipulated easily. Figure 5 shows one 





















































a +J 

bD t3 

















































































































f ) 

c ) 

C ) 






















0) tjD 

t3 C tifl 


3 -H C 


-P ;h ■-< 


-P X) =i . 


r^ tS p> 


CO >J -H 


iH CO 


^ f^ fH a 



tifl Q) -H O 



•■HO, CO 



^ O ipi tiO 



^1 c 

+:> Pj -P -h 




CO -rH g 




0) U 


+J rH Cp O 




CO tJ -H Cm 




:3 -H 



1— 1 

^, <D S 




^ <o a o 



. P> fl -H fH 

W) ft M -H t)0 Cw 1 



CO CO bO « 



fn -P a 0) (U 

bO O. 'Ci -rt (D o 1 



. (D -H 

jH <U 0) CO ^ 



g CO ^ bO +^ +3 



O CO P' C -PC 


t3 a) -H -p ^ (D 



f-i (U t3 ^H (50 l> 



p> o Jid q CO -H d) 
3 c CO c3 -p rH ;h 



bO CO 

O -P S rH CO Cm Pi 





<; pq O Q w 

















































• ^<i 











Cm CJ 








Cm t3 








•H C 






O M 







side of such a card with an item and two sets of item analysis data pertaining 
to it already entered on the card. The item was cut out of the tryout booklet 
(which had been photo-offset) and was affixed with rubber cement to the 
card in the space provided. This procedure saved the labor of retyping and 
proofreading each item. If the number of lines in an item is so large that 
it will not fit in the space provided, the top part can be pasted and the 
bottom part folded over. 

The choice-by-choice data derived from an internal-consistency item 
analysis are shown in the two columns at the extreme left of the card. 
Similar data derived from an item analysis, using an external criterion 
(graduation or elimination from elementary pilot training in the Army 
Air Forces) , are shown in the two right-hand columns. The difficulty index 
in each of the two samples is shown as the corrected percent of examinees 
who marked the right answer. There was no one who failed to reach the 
item in the time limit, though a small percent of each group omitted the 
item after reading it. The discrimination index derived from the internal- 
consistency analysis is expressed as a correlation coefficient read from the 
Flanagan table; the one derived from the external-criterion analysis is 
expressed as a tetrachoric correlation coefficient read from computing 

The reverse side of the same item card is shown in Figure 5 continued. 
Space is provided for additional item analysis data and for comments. 
A revision of the item has been suggested by one of the critics to whom it 
was submitted for scrutiny. The wording has been improved and choice E 
has been replaced with one that appears more attractive to examinees, 
especially to those likely to fail in pilot training. 

If sufficient clerical help is available, it is desirable to make two cards for 
each item that is tried out. The first copy is filed by subject matter; the 
second is used to construct the final form of the test and is filed numerically 
by test form if it is included in a final form. If not, it is filed by subject 
matter in a "discarded item" file that can be used as a source of ideas for 
new items. In any large-scale test construction agency an item file of the 
kind described will not be found more elaborate than is desirable. For 
teachers and individuals who construct occasional tests, a single file of item 
cards may be adequate. 

Selected References 

Annotated lists of references pertaining to item analysis data and techniques 
of item selection may be found in the compilations by Swineford and Holzinger 
published in the June issue of The School Review for several years past. Special 
issues of the Review of Educational Research, the first and second issues of the 


Yearbook of Research and Statistical Methodology edited by Euros, and the 
annual bibliographies published by Good in the Journal of Educational Research 
may also be consulted for classified lists of references concerning test construction. 
In view of the bibliographical material already available, no effort has been 
made to present a complete list of references here. Only articles of special interest 
in connection with this chapter are listed. 

Techniques of Item Analysis 

1. Adkins, D. C, et al. Construction and Analysis of Achievement Tests: The Develop- 
ment of Written and Performance Tests of Achievement for Predicting fob Perform- 
ance of Public Personnel. Washington: Government Printing Office, 1947. 

2. Arnold, J. N. "Nomogram for Determining Validity of Test Items," Journal of 
Educational Psychology, 2%: 151-53, 1935. 

3. Baker, K. H. "Item Validity by the Analysis of Variance; An Outline of Method," 
Psychological Record, 3: 242-48, 1939. 

4. Barthelmess, H. M. The Validity of Intelligence Test Elements. ("Teachers College 
Contributions to Education," No. 505.) New York: Teachers College, Columbia 
University, 1931. 

5. Chapanis, A. "Notes on the Rapid Calculation of Item Validities," Journal of Educa- 
tional Psychology, 32: 207-304, 1941. 

6. Clark, E. L. "A Method of Evaluating the Units of a Test," Journal of Educational 
Psychology, 19: 263-65, 1928. 

7. Davis, F. B. Item-Analysis Data: Their Computation, Interpretation, and Use in Test 
Construction. ("Harvard Education Papers," No. 2.) Cambridge: Graduate School of 
Education, Harvard University, 1946. 

8. DuBois, P. H. "A Note on the Computation of Biserial r in Item Validation," 
Psychotnetrika, 7: 143-46, 1942. 

9. DuNLAP, J. W. "Nomograph for Computing Bi-Serial Correlations," Psychometrika, 
1: 59-60, 1936. 

10. . "Note on Computation of Biserial Correlations in Item Evaluation," Psycho- 

7netrika, 1: 51-58, 1936. 

11. Ferguson, G. A. "Item Selection by the Constant Process," Psychometrika, 7: 19~-29, 

12. Flanagan, J. C. "General Considerations in the Selection of Test Items and a Short 
Method of Estimating the Product-Moment Coefficient from the Data at the Tails 
of the Distributions," Journal of Educational Psychology, 30: 674-80, 1939. 

13. Fulcher, J. S., and Zubin, J. "The Item Analyzer: A Mechanical Device for Treating 
the Four-fold Table in Large Samples," Journal of Applied Psychology, 26: 511-22, 

14. Garlough, L. N. "A Convenient Method for Calculating Indices of Ease and of 
Differentiating Ability for Individual Test Questions," Journal of Educational Re- 
search, 35: 611-17, 1942. 

15. Goheen, H. W., and Kavruck, S. "A Worksheet for Tetrachoric r and Standard 
Error of Tetrachoric r Using Hayes' Diagrams and Tables," Psychometrika, 13: 279-80, 

16. Guilford, J. P. "The Phi Coefficient and Chi Square as Indices of Item Validity," 
Psychometrika, 6: 11-19, 1941. 

17. Guttman, L. "The Cornell Technique for Scale and Intensity Analysis," Educational 
and Psychological Measurement, 7: 247-80, 1947. 

18. Johnson, A. P. "An Index of Item Validity Providing a Correction for Chance 
Success," Psychometrika, 12: 51-58, 1947. 

19. Kelley, T. L. "The Selection of Upper and Lower Groups for the Validation of Test 
Items," Journal of Educational Psychology, 30: 17-24, 1939. 

20. Kolbe, L. E., and Edgerton, H. A. "A Table for Computing Biserial r," Journal of 
E.xperimental Education, 4: 245-51, 1936. 


21. KuDER, G. F. "Nomograph for Point Biserial r, Biserial r, and Four-fold Correlations," 
Psychometrika, 2: 135-38, 1937. 

22. Lawshe, C. H., Jr. "A Nomograph for Estimating the Validity of Test Items," 
Journal of Applied Psychology, 26: 846-49, 1942. 

23. Lentz, T. F., and Whitmer, E. F. "Item Synonymization: A Method for Determining 
the Total Meaning of Pencil-Paper Reactions," Psychometrika, 6: 131-39, 1941. 

24. Lev, J. "Evaluation of Test Items by the Method of Analysis of Variance," Journal 
of Educational Psychology, 29: 623-30, 1938. 

25. LiNDQUiST, E. F., and Cook, W. W. "Experimental Procedures in Test Evaluation," 
Journal of Experimental Education, 1: 163-85, 1933. 

26. Loevinger, J. A Systematic Approach to the Construction and Evaluation of Tests 
of Ability. ("Psychological Monographs," No. 285.) Washington: American Psycho- 
logical Association, 1947. 

27. Long, J. A., and Sandiford, P., et al. The Validation of Test Items. Toronto: 
Department of Educational Research, University of Toronto, 1935. 

28. MosiER, C. I., and McQuiTTY, J. V. "Methods of Item Validation and Abacs for 
Item-Test Correlation and Critical Ratio of Upper-Lower Difference," Psychometrika, 
5: 57-65, 1940. 

29. Richardson, M. W. "Notes on the Rationale of Item Analysis," Psychotnetrika, 1: 
69-76, 1936. 

30. Richardson, M. W., and Stalnaker, J. H. "A Note on the Use of Biserial r in Test 
Research," Journal of General Psychology, 8: 463-65, 1933. 

31. RoYER, E. B. "A Machine Method for Computing the Biserial Correlation Coefficient 
in Item Validation" Psychotnetrika, 6: 55-59, 1941. 

32. TURNBULL, W. W. "A Normalized Graphic Method of Item Analysis," Journal of 
Educational Psychology, 37: 129-41, 1946. 

33. Vernon, P. E. "Indices of Item Consistency and Validity," British Journal of Psy- 
chology, Statistical Section, 1: 152-66, 1948. 

34. Votaw, D. F. "Graphical Determination of Probable Error in Validation of Test 
Items," Journal of Educational Psychology, 25: 682-86, 1933. 

35. . "Notes on the Validation of Test Items by Comparison of Widely Spaced 

Groups," Journal of Educational Psychology, 25: 185-91, 1934. 

36. Walker, H. M. "Item Selection by Sequential Sampling," Teachers College Record, 
50: 404-9, 1949. 

37. Walker, H. M., and Cohen, S. Probability Tables for Item Analysis by Means of 
Sequential Sampling. New York: Bureau of Publications, Teachers College, Columbia 
University, 1949. 

38. ZuBiN, J. "The Method of Internal Consistency for Selecting Test Items," Journal of 
Educational Psychology, 2S: 345-56, 1934. 

Studies Related to Item Analysis Techniques 

39. Barry, R. F. "An Analysis of Some New Statistical Methods for Selecting Test Items," 
Journal of Experimental Education, 7: 221-28, 1939. 

40. Brogden, H. E. "On the Interpretation of the Correlation Coefficient as a Measure 
of Predictive Efficiency," Journal of Educational Psychology, 37: 65-76, 1946. 

41. Carter, H. D. "How Reliable Are the Common Measures of Difficulty and Validity 
of Objective Test Items?" Journal of Psychology, 13: 31-39, 1942. 

42. Davis, F. B. "Notes on Test Construction: The Reliability of Item-Analysis Data," 
Journal of Educational Psychology, 37: 385-^90, 1946. 

43. (ed.). The AAP Qualifying Examination. ("Aviation Psychology Program 

Research Reports," No. 6.) Washington: Government Printing Office, 1947. Ap- 
pendix A. 

44. DopPELT, J. E., and Potts, E. M. "The Constancy of Item-Test Correlation Coeffi- 
cients Computed from Upper and Lower Groups," Americati Psychologist, 3: 261, 
1948. (Abstract.) 

45. Flanagan, J. C. Factor Analysis in the Study of Personality. Stanford, Calif.: 
Stanford University Press, 1935. 


46. FORLANO, G., and Pintner, R. "Selection of Upper and Lower Groups for Item 
Validation," Journal of Educational Psychology, 32: 544-49, 1941. 

47. Gibbons, C. C. "The Predictive Value of the Most Valid Items of an Examination," 
Journal of Educational Psychology, 31: 616-21, 1940. 

48. GiLLMAN, L., and Goode, H. H. "An Estimate of the Correlation Coefficient of a 
Bivariate Normal Population When X Is Truncated and Y Is Dichotomized," Harvard 
Educational Revieu', 16: 52-55, 1946. 

49. Guilford, J. P., and Lacey, J. I. (eds.). Printed Classification Tests. ("Aviation 
Psychology Program Research Reports," No. 5.) Washington: Government Printing 
Off.ce, 1947. 

50. Handy, U., and Lentz, T. F. "Item Value and Test Reliability," Journal of Educa- 
tional Psychology, 25: 703-8, 1934. 

51. Hayes, S. P., Jr. "Diagrams for Computing Tetrachoric Correlation Coefficients from 
Percentage Differences," Psychometrika, 11: 163-72, 1946. 

52. HORST, A. P. "Regression Weights as a Function of Test Length," Psychofnetrika, 
13: 125-34, 1948. 

53. JURGENSEN, C. E. "Table for Determining Phi Coefficients," Psychometrika, 12: 
17-29, 1947. 

54. Katzell, R. a., and Cureton, E. E. "Biserial Correlation and Prediction," Journal of 
Psychology, 24: 273-78, 1947. 

55. Kroll, a. "Item Validity as a Factor in Test Validity," Journal of Educational 
Psychology, 31: 425-36, 1940. 

56. Lawshe, C. H., Jr., and Thayer, J. S. "Studies in Item Analysis: I. The Effect of 
Two Methods of Item Validation on Test Reliability," Journal of Applied Psychology, 
21: 271-77, 1947. 

57. Lentz, T. F.; Hirshstein, B. ; and Finch, F. H. "Evaluation of Methods of Evaluat- 
ing Test Items," Journal of Educational Psychology, 23: 344-50, 1932. 

58. Lord, F. M. "Reliability of Multiple-Choice Tests as a Function of Number of Choices 
Per Item," Journal of Educational Psychology, 35: 175-80, 1944. 

59. McNamara, W. J., and Weitzman, E. "The Economy of Item Analysis with the 
IBM Graphic Item Counter," Journal of Applied Psychology, 30: 84-90, 1946. 

60. Merrill, W. W., Jr. "Sampling Theory in Item Analysis," Psychometrika, 2: 215- 
23, 1937. 

61. MoLLENKOPF, W. G. "An Experimental Study of the Effects on Item-Analysis t)ata 
of Changing Item Placement and Test Time Limit," Psychometrika, 15: 291-315, 

62. Mosier, C. I. "A Note on Item Analysis and the Criterion of Internal Consistency," 
Psychometrika, 1: 275-82, 1936. 

63. Pintner, R., and ForlanO; G. "A Comparison of Methods of Item Selection for a 
Personality Test," Journal of Applied Psychology, 21: 643-52, 1937. 

GA.. Swineford, F. "Validity of Test Items," Journal of Educational Psychology, 27: 
68-78, 1936. 

65. . "Biserial r Versus Pearson ;• as Measures of Test-Item Validity," Journal of 

Educational Psychology, 27: 471-72, 1936. 

66. Wesman, a. G. "Effect of Speed on Item-Test Correlation Coefficients," Educational 
and Psychological Measurement, 9: 51-57, 1949. 

67. Wherry, R. J., and Gaylord, R. H. "The Concept of Test and Item Reliability in 
Relation to Factor Pattern," Psychometrika. 8: 247-64, 1943. 

68. ZuBiN, J. "Note on a Transformation Function for Proportions and Percentages," 
Journal of Applied Psychology, 19: 213-20, 1935. 

Item Difficulty 

69. Brogden, H. E. "Variation in Test Validity with Variation in the Distribution of 
Item Difficulties, Number of Items, and Degree of Their Intercorrelation," Psycho- 
metrika, 11: 197-214, 1946. 


70. Carroll, J. B. "The Effect of Difficulty and Chance Success on Conclatiuns Between 
Items or Between Tests," Psychoineirika, 10: ]-19, 1945. 

71. Chen, L. "The Correction Formula for Matching Tests," Journal of Educational 
Psychology. 35: 565-66, 1944. 

72. Davis, F. B. "The Selection of Test Items According to Difficulty," Amer/can Psy- 
chologist, 4: 243, 1949. (Abstract.) 

73. Ferguson, G. A. "On the Theory of Test Discrimination," Psychometrika, 14: 61-68, 

74. Guilford, J. P. "The Determination of Item Difficulty When Chance Success Is a 
Factor," Psychometrika, 1: 259-64, 1936. 

75. GULLIKSEN, H. O. "The Relation of Item Difficulty and Inter-Item Correlation to 
Test Variance and Reliability," Psychometrika. 10: 79-92, 1945. 

76. Harris, C. W. "Prediction of the Difficulty Index of Objective-Type Spelling Items," 
Educational and Psychological Meastirement,! : 319-25, 1947. ^ 

11. HORST, A. P. "The Chance Element in the Multiple-Choice Test Item," Journal of 

General Psychology, 6: 209-11, 1932. 
78. . "The Difficulty of a Multiple-Choice Test Item," Journal of Educational 

Psychology, 24: 229-32, 1933- 

79. Jackson, R. W. B., and Ferguson, G. A. "A Plea for a Functional Approach to 
Test Construction," Educational and Psychological Measurement. 3: 23-28, 1943. 

80. Lawley, D. N. "On Problems Connected with Item Selection and Test Construction," 
Proceedings of the Royal Society of Edinburgh, 61: Section A, Part III 273-87, 

81. Odell, C. W. "The Scoring of Continuity or Rearrangement Tests," Journal of 
Educational Psychology, 35: 352-56, 1944. 

82. Richardson, M. W. "The Relation Between the Difficulty and the Differential Valid- 
ity of a Test," Psychometrika, 1: 33-49, 1936. 

83. Symonds, p. M. "Factors Influencing Test Reliability," Journal of Educational Psy- 
chology, 19: 73-87, 1928. 

84. Thurstone, T. G. "The Difficulty of a Test and Its Diagnostic Value," Journal of 
Educational Psychology, 23: 335-43, 1932. 

85. Walker, D. A. "Answer-Pattern and Score-Scatter in Tests and Examinations," Brit- 
ish Journal of Psychology, 22: 73-86, 1931; 26: 301-8, 1936; 30: 248-60, 1940. 

The Combination of Test Items 

86. Adkins, D. C, and TooPS, H. A., "Simplified Formulas for Item Selection and Con- 
struction," Psychometrika, 2: 165-71, 1937. 

87. Brogden, H. E. "An Approach to the Problem of Differential Prediction," Psycho- 
metrika, 11: 139-54, 1946. 

88. Flanagan, J. C. "A Short Method for Selecting the Best Combination of Test Items 
for a Particular Purpose," Psychological Bulletin, 33: 603-4, 1936. (Abstract.) 

89. Gulliksen, H. O. Selection of Test Items by Correlation With an External Criterion, 
as Applied to a Mechanical Comprehension Test. (Office of Scientific Research and 
Development, Report No. 13319.) Washington: Department of Commerce. 1946. 

90. Horst, a. p. "Determination of Optimal Test Length to Maximize the Multiple Cor- 
relation," Psychometrika, 14: 79-88, 1949. 

91. . "The Economical Collection of Data for Test Validation," Journal of Ex- 
perimental Education, 2: 250-53, 1934. 

92. . "Item Analysis by the Method of Successive Residuals," Journal of Experi- 
mental Education, 2: 254-63, 1934. 

93. . "Item Selection by Means of a Maximizing Function," Psychometrika, 1: 

229-44, 1936. 

94. Long, W. F., and Burr, I. W. "Development of a Method of Increasing the Utility 
of Multiple Correlations by Considering Both Testing Time and Test Validity," 
Psychotnetrika, 14: 137-61, 1949. 


95. Owens, W. A. "An Empirical Study of the Relationship Between Item Validity and 
Internal Consistency," Educational and Psychological Measurement, 7: 281-88, 1947. 

96. Richardson, M. W., and Adkins, D. C. "A Rapid Method of Selecting Test Items," 
Journal oj Educational Psychology, 29: 547-52, 1938. 

97. Toops, H. A. "The L-Method," Psychometrika, 6: 249-66, 1941. 

Statistical Tables and References 

98. Chesire, L.; Saffir, M.; and Thurstone, L. L. Computing Diagrams for the 
Tetrachoric Correlation Coefficient. Chicago: University of Chicago Bookstore, 1933. 

99. Fisher, R. A. Statistical Methods for Research Workers. London: Oliver & Boyd, 1938. 
100. — , and Yates, F. Statistical Tables for Biological, Agricultural, and Medical 

Research. London: Oliver & Boyd, 1938. 
10.1. Kelley, T. L. Fundamentals of Statistics. Cambridge: Harvard University Press, 1947. 

102. . The Kelley Statistical Tables. Cambridge: Harvard University Press, 1948. 

103. . Statistical Method. New York: Macmillan Co., 1924. 

104. McNemar, Q. Psychological Statistics. New York: Wiley, 1949. 

105. Pearson, K. (ed.) Tables for Statisticians and Biometricians, Part I. London: Bio- 
metric Laboratory, University College, 1914. 

106. (ed.). Tables for Statisticians and Biometricians, Part TI. London: Biometric 

Laboratory, University College, 1931. 

10. Administering and Scoring the 
Objective Test 

By Arthur E. Traxler 

Educational Records Burea?i 

Collaborators: Edward E. Cureton, University of Tennessee; Paul L. 
Dressel, Michigan State College; Herbert A. Toops, Ohio State Univer- 

The realization of the potential values of educational 
measurement depends largely upon the understanding, accuracy, and com- 
petence with which tests are administered and used by the multitude of 
nonspecialists in measurement responsible for the teaching and guidance 
of the pupils in our schools. If test specialists are to avoid having their 
efforts to construct precise, well-standardized tests negated, they must take 
every precaution to safeguard the application and use of the tests. If school 
administrators are to find the results of tests worth the time and expense, 
great care must be taken to insure that the accuracy of the scores is not 
vitiated through misunderstanding or carelessness in the administration and 
scoring of the tests. 

The Importance of the "Mechanical" Procedures 

Valid Results Depend upon Accurate 
Administration and Scoring 

In view of their crucial importance in the whole chain of events from 
the conception of the test to the use of the scores in conferences with indi- 
viduals, it seems highly unfortunate that the giving and scoring of tests 
are frequently treated very casually by both the authors and the users of 
tests. Test specialists have been dilatory in providing research data on the 
many debatable points relative to test administration and scoring, and, in 
general, test makers have not applied the same care and zeal to the writing 
of directions for administering tests that they have applied to item valida- 
tion and other technical aspects of test construction. The reasons for special 
points in directions are seldom given, and needful precautions are seldom 
stated. Even when the directions have been carefully formulated, too often 
they have not been observed faithfully by test users. 



The point should be stressed that tests are standardized on the basis of 
a particular set of directions for administering and scoring. Comparisons 
with norms are valid only when exactly the same procedure is used in 
administering and scoring the tests locally that was employed when the 
norms were established. It is likewise true that full comparability of scores 
from school to school within a system or a common testing program can 
be achieved only if all schools give the tests in the same way and have 
a maximum degree of scoring accuracy, or at least an equivalent bias in 
scoring errors. More important, still, valid comparisons from test to test — 
that is, the diagnostic values of tests for the individual — depend upon 
careful administration and accurate scoring of all the tests. 

Inadequate Administration and Scoring 
May Vitiate Tests Results 

There are numerous ways in which administration and scoring errors may 
impair test scores. Some types of improper administration or scoring cause 
bias in the results for entire groups and render intergroup comparisons of 
little value. Other kinds affect only certain individuals in varying degrees 
and thus lower the validity of the test results for guidance, counseling, and 
self -insight. Some of the more important inadequacies and errors in the giv- 
ing and scoring of tests will be mentioned briefly. Most of them will be 
treated in greater detail later in the chapter. 

Administration. — Probably the greatest single source of error in test 
administration is incorrect timing of tests that involve a time limit. Failure 
to read the timing directions correctly, lack of understanding on the part of 
inexperienced examiners of the need for precise timing, carelessness, faulty 
timepieces, and unintelligent scheduling so that it is impossible to allow 
enough time for a given test and still maintain the schedule — all these are 
potential sources of large errors which affect the scores of all individuals 
in the group. Test makers and publishers can help to prevent timing errors 
by emphasizing the need for precise timing, by outlining for the examiner a 
very definite time schedule from the very beginning of the test to its end, 
including rest periods, by showing the time in boldface print, and by sum- 
marizing the time limits in a time-administration table as well as having 
them interspersed at the proper places in the verbal directions. 

Having time limits long enough that a time-limit test virtually is a work- 
limit test for a majority of the examinees serves to reduce the ill effects of 
bad test administration. This can be done, of course, only in those tests 
where the speed of response is not itself the object of the test. This pro- 
cedure evades the chief fault of the work-limit method; namely, that it does 


not fit well with fixed time schedules of the school. It has its disadvantage 
in discipline and motivation. 

When a battery of tests is administered, directions for later tests can 
often be greatly shortened. It is often of value in such cases to write special 
directions for all tests. Those for the first are usually identical with the 
original ones; those for later tests are shortened as is appropriate to elimi- 
nate duplication. Failure to eliminate duplication makes it possible for 
some examinees to start a later test while the directions are being read, and 
results in boredom for the examinees and waste of time for all concerned. 

A second source of large error in administration is lack of clarity in the 
directions to the examinees. Failure to make the directions clear to the 
examinees may result from an unusually complicated testing situation which 
is not explained in sufficient detail in the manual, blind spots on the part of 
the test maker in preparing the directions, confusion of the examinee due 
to inadequate preparation, slovenly reading of the directions by the ex- 
aminer, use of a vocabulary beyond the level of the examinees, inat- 
tention of individual examinees, and too many directions or directions 
too far removed from their application. Confusion of directions is, in some 
respects, more serious than mistakes in timing. Errors in timing in effect 
either add or subtract a constant amount of time (but not of performance) 
for all pupils in the group, except those who would finish the test anyhow, 
and approximate adjustments in the scores can sometimes be made. Unclear 
directions, on the other hand, may lower the scores of either the entire group 
or of certain pupils in varying amounts. The test manual should contain 
suggestions to examiners concerning procedures in presenting the directions 
and in discussing examples, and specific recommendations concerning the 
extent to which informal supplementary explanation (if any) may properly 
be given to the pupils. 

A third source of error in the administration, as well as in the scoring, of 
objective tests is failure to make clear to pupils what they are expected to 
do about guessing, or failure to indicate how, if at all, the scores are to be 
corrected for guessing. Although the problem created by guessing bulks 
large in test theory and research, it probably is, from a practical standpoint, 
a less important source of error than the first two mentioned. Nevertheless, 
this problem is important enough to warrant considerable attention later in 
this chapter. 

A fourth source of error in test administration is variation in the physical 
conditions under which tests are administered. This area has been almost 
entirely neglected by test specialists, but there is reason to believe that 
sometimes it has an important bearing on test results. Such factors are more 


likely to have an adverse effect on time-limit tests than on work-limit ones. 

Schools vary widely in their facilities for the administration of tests to 
groups. In the matter of writing space alone, some schools use large, com- 
fortable tables, others use desks, others armchairs, and still others give their 
tests in the auditorium with each pupil writing on a portable beaverboard 
"desk," or even on his lap. It is not reasonable to expect fully comparable 
results under such varying conditions. The current tendency to use separate 
answer sheets further complicates the problem created by differences in 
physical equipment. The examinee needs more desk space, horizontally, 
when employing separate answer sheets than otherwise. Test makers would 
do well to introduce into their manuals of directions a section on the en- 
vironmental conditions under which the test may properly be given. They 
also have an obligation, incidentally, to see that the data for norms are ob- 
tained under defined environmental conditions and to report to users just 
what those conditions are. 

Other sources of error in the giving of tests are too little or too much 
stress on motivation and failure to control opportunities for chance or pur- 
poseful copying. The problems in these two areas are largely those of the 
local examiner, although the test author can aid the examiner materially by 
means of appropriate suggestions in the test manual. 

Scoring. — Scorers' tests administered to applicants for scoring positions 
by the Educational Records Bureau and other test service organizations 
reveal extremely wide differences in aptitude for this kind of work as 
measured in terms of accuracy and speed. If scoring is to be done well, indi- 
viduals need to be carefully selected, trained, supervised, and checked 
upon systemically. Most significant scoring errors are caused by assignment 
of this work to individuals who are not accurate in clerical routine, by in- 
effective training, lack of adequate directions concerning procedure, and 
failure to see that all processes where large errors could occur are invari- 
ably checked by a second person. 

On importance of checking. — Only a comparatively few people, even 
teachers, can add as many as three subtest scores correctly without making 
substantial errors occasionally, so this process should always be checked. 
Only a few know how to alphabetize, so unless this function is done by an 
expert, an examinee who has taken a half-dozen tests will show up with 
some missing when the test scores are assembled for a profile. In large tesl 
projects, the alphabetization of the test results becomes increasingly cumber- 
some. The process will be facilitated by imprinting on the answer medium 
five blocks for the printing-in by the examinee, in capital letters, of five 
letters: the initial letter of his last name, the second and third letters of his 


last name, and the initial letters of his first and middle names. Thus, George 
Robert Johnson would respond |J|0|H|G|R|. Such matters save hundreds 
of hours of labor on large test projects. The principle is that a small amount 
of time (free) on the part of a large number of examinees will save a large 
amount of time of a (paid) clerical force. This principle has wide 
applicability. If examinees could reliably score their own or each other's 
papers in a few minutes time, no scoring device in which the papers have to 
be fed to the machine individually could compete with it. 

As will be indicated in greater detail later, test makers can reduce the 
number of errors and increase scoring speed by the arrangement of the ma- 
terial on the page, by careful planning of the scoring keys, by preparation 
of separate answer sheets specially designed to facilitate scoring, and some- 
times by overprinting of the key on the answer sheets by a second impres- 
sion after the tests have been administered. 

Need for Meticulous Attention to Details 

There is a curious fallacy in the contemporary attitude toward the ad- 
ministration and scoring of group tests as compared with individual tests. 
It is doubtful if any educational institution would trust the administration 
and scoring of the individual Stanford-Binet Scale to anyone other than a 
trained psychometrician, but there is a rather general impression among 
school people that almost anyone can administer and score a group test if 
only he has a manual at hand. As a matter of fact, however, the administra- 
tion of a group test to a class of pupils in one sense is a far more exacting 
procedure than the giving of an individual test to one pupil. The routine 
is more rigid and the penalty for error is multiplied by the number of in- 
dividuals in the group. In the administration of an individual test a certain 
amount of leeway may be allowed in order to create a situation favorable 
for the eliciting of responses from that particular individual, but when 
testing a group, complete fidelity to all details of the prescribed procedure 
is imperative if the examiner is to avoid wasting the time of many indi- 
viduals and if the results are to have the same meaning for all. 

A high level of success in either the administration or scoring of tests is 
conditioned largely by a willingness on the part of the examiner to attend 
and by a habit of attending religiously to a host of details, any one of which 
may seem of little importance. The test author needs to outline everj' detail 
clearly, completely, and with convincing emphasis upon its importance. The 
examiner must understand that each detail is an integral and indispensable 
part of the standardization process and necessary for adequate comparisons 
of local groups with test norms. 


The Administration of Tests 

As suggested in the preceding section, the accurate and efficient giving of 
a test requires close cooperation between the test author and the examiner. 
It is desirable to review in considerable detail the responsibilities of both 
these functionaries with regard to test administration. 

Administration from the Viewpoint of the Test Maker 

Although it is occasionally necessary for the examiner to make minor 
adjustments to fit the local situation, the conditions of test administration 
should, as far as possible, be under the control of the test author, especially 
if use is to be made of the norms. Comparability in the results from class 
to class, school to school, and test to test can be expected only when this 
principle is observed. It is, therefore, one of the major responsibilities of 
the test maker to decide and specify clearly all conditions of administration. 
The preparation of directions for giving the test is universally accepted as 
a phase of test construction, but too often it is regarded as a minor aspect 
to be handled hastily just as the test is about to go to press. The formulation 
of the instructions, including the application of research in reaching deci- 
sions, should be regarded as a professional task equally important with item 
writing and validation. 

Among the main decisions which a test author needs to make before he 
writes the directions for administration are those concerned with time 
limits, motivation of the subjects, and control of guessing. 

Setting of time limits 

In measurement dealing with "higher thought processes," counselors, 
school administrators, and the subjects themselves usually are primarily 
interested in determining level rather than speed of work. Speed is ad- 
mittedly of some individual importance, particularly when the rate of work 
is exceptionally rapid or exceedingly slow in connection with certain skills 
constantly in use, such as reading, alphabeti2ation, and the like. It is also 
decidedly important in some mechanical jobs. Since most life situations, 
however, ordinarily do not call for completion of a certain task within rigid 
time limits, success generally is conditioned by level more than by speed. 
Two individuals with equal level of achievement may be deemed equally 
successful on a job even though one is twice as fast as the other, if the 
second one is able and willing to spend twice as much time on the task. 
Theoretically, most educational tests might well be untimed if efficient 
use of testing time were not important and if it were not necessary for test- 
ing programs to conform to school time-scheduling. 


The school day is usually divided into class periods of forty to sixty 
minutes' duration. It Is no doubt desirable for schools to plan special sched- 
ules to handle their testing programs, but many schools do not find it 
feasible to do this. Consequently, there is widespread demand on the part 
of schools for tests that will conform to the customary class periods. Most 
test authors accede to this demand and set their tests for approximately forty 
minutes, or multiples thereof. The result is that many achievement tests are 
entirely too short to permit adequate reliability of performance, to say noth- 
ing of validity of response. 

In order to take account of class periods, a test maker may either use more 
material than most pupils can cover in the period and set up a rigid time 
schedule, or he may make his test primarily one of level by using so small 
an amount of material that all, or practically all, pupils can complete it 
within the designated period. The second procedure is used sparingly, for it 
tends to lower the statistical reliability of the test, and, moreover, it fre- 
quently is frowned upon by school people since they feel that it complicates 
the discipline problem during test administration and encourages idleness 
and time wasting on the part of the faster pupils. The upshot is that most 
tests designed for school use are timed tests. 

On objections to time-limit tests. — -If the disciplinary problem created 
by having pupils finish at different times can be surmounted, many of the 
difficulties of time limits are avoided by employing work-limit tests. The 
latter afford generally better motivation. Furthermore, since all items are 
attempted by everyone, the test responses may be analyzed to pick out those 
items most worthy of perpetuation in subsequent editions of the tests — as- 
suming that a criterion is, or can, be made available. If this is done, the 
individual items of a test should improve in statistical merit in subsequent 
editions. This is difficult to achieve in time-limit tests. Time-limit tests, 
furthermore, are prone to yield spuriously high reliability coefficients when 
they are computed by split-halves or internal consistency methods. 

Relationship of test length to objectives of measurement. — Theoretically, 
the length of a test and the time limit for it should be determined in the 
light of the purpose for which the test is given. It is well known that tests 
used to evaluate groups may be very short, for reliable differences among 
groups may be obtained on the basis of a small number of questions. If the 
purpose is over-all appraisal of individuals, in terms of total score, the test 
must be longer, but for few aspects of school aptitude or achievement does 
it need to be longer than one class period in order to be satisfactorily 
reliable. If individual diagnosis is desired, a total of three or four class 
periods may be called for in order to use enough test items to bring about 


sufficient reliability in the several diagnostic scores. Considering its potential 
importance to the individual examinee, he could profitably devote, if need 
be, many times the number of hours he now devotes to test taking. 

As a matter of fact, it is seldom feasible, aside from specific experimental 
situations, to separate objectives in this way. Schools tend to use a single 
test for a variety of objectives. They wish to make interschool comparisons, 
to evaluate class groups, to provide data for the counseling of individuals, 
and to diagnose pupil strengths and weaknesses, all on the basis of tests 
requiring exactly one class period. The test maker usually compromises by 
aiming first of all at obtaining valid and reliable total scores of individuals 
while at the same time providing an opportunity for limited diagnosis 
through setting up part scores on three or four of the more important areas 
covered by the test. 

Need for experimental determination of optimum time limits. — There is 
a definite need for test makers to determine on the basis of research the 
amount of time to be allowed for different types of material. Theoretically, 
the problem of setting time limits for educational achievement tests involves 
finding the answer to either of two questions: (1) "For a given amount 
of test material, what amount of time per item will result in the maximum 
validity per unit of time?" or, (2) if the test is to fit a certain predetermined 
time interval, ""What number of items will yield the maximum validity in 
the designated time?" 

One section of a study by Cook {^11^ illustrates the application of this 
type of experimentation to the measurement of spelling ability. Cook defined 
the optimum time for the administration of the test as ""that time at which 
further increase in the validity of the obtained scores can better be secured 
through the addition of more material with a proportionate increase in time 
than by permitting more time on the same material." He then applied this 
definition to various arrangements of a 50- word spelling test at the eighth- 
grade level, using as a criterion a 150-word list dictation spelling test. His 
results indicated optimum rates in the testing of eighth-grade pupils, such as 
ten words a minute for a right-wrong spelling test, five words a minute for 
a four-response spelling test, and about four words a minute for a word 
in a sentence-recall spelling test. 

This technique assumes that an external criterion of validity is available. 
The criterion may be concerned exclusively with power or level to the com- 
plete exclusion of speed, but even in this situation the "best" time limit for 
the experimental test is not necessarily an indefinite or even a liberal time 
limit. For example, the total score on a 100-item experimental test might 
correlate .80 with a given criterion, if the administration time were liberal 


enough to allow all pupils to finish, say two hours, and might show a 
lower correlation with the criterion if administered in any shorter time 
limit, in which only part of the pupils finish. This finding would not mean 
that the best results will be obtained by administering this 100-item test in 
two hours. It is possible that the great majority of the examinees are able to 
complete this test in one hour, and that the time of most of them is wasted, 
with no improvement in the validity of their scores, in order to obtain a 
slight increase in validity in the scores of just two or three slow workers. It 
might be better, therefore, to improve the validity of the majority of the 
scores by increasing the length of the test (but not the time), even at some 
sacrifice in the validity of the scores of the few slow examinees. That is, one 
might secure a more valid score for the typical examinee by administering 
more than 100 items in the two hours. Experimentation might show, for ex- 
ample, that 164 items administered in two hours might yield a total score 
that would correlate .82 with the criterion. More items would yield a lower 
correlation because too much emphasis is then placed on speed; fewer items 
would yield a lower correlation because of the more limited sampling of 
content. It is true that the 164 items might yield a still more valid score in 
a longer time period, but they would, nevertheless, represent the most 
efiicient way to use the two hours. 

If speed is not a factor in the criterion at all, the total score on a given 
amount of test material will yield the maximum total validity if administered 
under work-limit conditions, but not the maximum validity per unit of time. 
Thus, even though speed is not a factor in the criterion at all, one might 
make most efiicient use of whatever amount of time is taken, and improve 
the validity of the majority of scores, if a time limit is employed which will 
not permit all pupils to finish. This principle should be recognized in 
achievement test construction in general, even though external criteria are 
not available to make possible the experimental determination of optimum 
time limits. 

In actual practice, of course, external criteria of validity are rarely avail- 
able in the construction of educational achievement tests. It is usually neces- 
sary to compromise between what is sometimes regarded as the ""ideal" 
work-limit conditions and various practical considerations. It is known that 
pupils differ greatly in speed and that if a test is administered as a work- 
limit test in which even the very slowest workers are allowed to finish, 
either a considerable amount of time will be wasted for most of the pupils 
and a disciplinary problem will be created in attempting to keep all pupils 
in their seats until all have finished, or irrelevant distractions will be intro- 
duced for pupils still working if the faster pupils are permitted to leave 


whenever they finish the test. Accordingly, test authors usually set the time 
limits so that between 80 and 90 percent of the pupils can consider or at- 
tempt all of the items and use various devices (such as "cushion" items of 
high difficulty at the end of the test and interspersed directions throughout 
the test period ) to keep the fast pupils occupied and to keep the slow pupils 
moving along more rapidly. Actually, higher validity per unit of time is 
probably secured on this basis than would be obtained if the same test were 
administered under work-limit conditions. 

There is a fertile and almost untouched field for research on the optimum 
rates of administration of tests in many subject fields and at different grade 

Relation of timing directions to the degree of independence in the vari- 
ables sampled by the parts of a test. — Two general procedures are used in 
timing the different parts of a test. One procedure is to time each part 
entirely independently through having all pupils start a part at exactly the 
same time and then wait for the signal, if they finish earlier, to go on to the 
next part. The other is to instruct the students in the beginning that if they 
finish a part before time is called they are to go on to the next part, and 
then to say at the expiration of the time for a given part, "If you have not 
finished Part I, stop working on that part and go on to Part II," and so 

The theory back of that kind of timing is that, other things being equal, 
the more items attempted, the higher the reliability and, indirectly, the 
higher the validity as well. A pupil who is rapid on one type of material 
may be slow on another type. If the test booklet, as a whole, contains 
enough material to keep most pupils working continuously throughout the 
entire period, it is reasoned that the reliability and validity of the total 
scores will be higher if the pupils do not spend time waiting for signals to 
proceed when they might be using that time in the solution of questions. 
Moreover, as indicated earlier, level rather than speed tends to be stressed 
when the subjects are allowed to use all the time in the testing period to 
full advantage. With the time limit about right, the directions serve only to 
keep the very slowest from taking an interminable amount of time on indi- 
vidual items. 

Although the logic of that point is obvious as far as the total score is 
concerned, it appears that decisions concerning exclusive or overlapping 
timing of the parts should be made on the basis of whether or not separate 
part scores are to be obtained and of whether the functions measured by the 
parts are highly correlated or relatively independent. The timing of the Co- 
operative Reading Comprehension Test is a case in point. This test has two 
parts — vocabulary and reading comprehension. It yields scores for vocabu- 


lary, speed of comprehension, and level of comprehension. Fifteen minutes 
are allowed for the vocabulary part and twenty-five minutes for the reading 
comprehension part. The examinees are instructed to go on to the reading 
comprehension part as soon as they have finished vocabulary even though 
the time for that part has not expired. Thus, the speed-of-comprehension 
score (but not the level score) represents a composite of rate of responding 
to vocabulary items and rate of reading paragraphs and answering ques- 
tions. This way of timing is defensible only if it can be shown that these 
functions are so highly correlated that the speed-of-comprehension Scaled 
Scores of the pupils under this overlapping time arrangement are virtually 
equivalent to, or better than, what they would be if the time limits were 
independent. Research on such matters is virtually nonexistent. 

Correlations among speed, level, and poiuer tests. — The question of the 
advisability of using time limits with tests designed to measure scholastic 
aptitude and achievement should be considered in the light of the correla- 
tions between timed and untimed tests. It is known that when speed and 
level are measured independently the correlation between the resulting scores 
is comparatively low. Blommers and Lindquist (6) reported that the with- 
in-grade, within-school correlation of rate of reading comprehension with 
reading comprehension from which the influence of speed was eliminated 
was approximately .30. Tinker {56) found that speed and level scores on 
the Revised Minnesota Paper Form Board Test varied independently and 
that about 75 percent of the variance of the time-limit score, which he 
called the "power" score was accounted for by speed and level. Likewise, 
Davidson and Carroll {17) in a factor analysis of a number of group mental 
tests, administered to college students, found that speed was linearly inde- 
pendent of level and that time-limit scores could be represented as fac- 
torially complex measures having heavy loadings of both speed and level. 
They concluded that because of their factorial complexity, time-limit scores 
should be used with considerable caution in the prediction of criteria. 

Other studies, however, have shown that when timed and untimed scores 
(liberal time limits) are obtained from the same test at the same sitting the 
correlations are comparatively high. For example, the correlation between 
speed of comprehension and level of comprehension on the Cooperative 
Literary Comprehension Test was found in one study {64) to be .917. 
While theoretically it would usually be preferable to administer most 
achievement tests without time limits if ample time were available for 
testing, the correlation between untimed, or "level," scores, and liberally 
timed, or "speed," scores usually is high enough to warrant the administra- 
tion of tests under liberally timed conditions. 

Eliminating the influence of time on certain tests. — The most common 


way of reducing or eliminating the influence of time on tests is to set the 
time limits so liberally that all, or nearly all, pupils are able to consider or 
attempt all the items in the test. It is apparently the intention of many 
achievement test constructors at present to allow from 80 to 90 percent of 
the pupils in a typical group to finish a section before instructing the group 
to turn to the next section. 

It is possible, however, to reduce the effect of the time factor in some 
tests through ingenuity in test construction without either cutting the 
amount of material in the test or allowing more time. A good illustration 
of one procedure for obtaining a level score in a timed test is furnished 
by the Cooperative Reading Comprehension Test. The paragraph reading 
part of this test contains 90 items arranged in three cycles of 30 items each. 
Since the three cycles are of comparable difficulty, each cycle is a shortened 
form of the entire test. The level of comprehension score is based on the 
average number of correct responses in the cycle completed by each subject, 
as contrasted with the speed-of -comprehension score, which is found from 
the number of correct responses in the time limit. Even the slowest indi- 
viduals nearly always finish the first cycle and thus obtain level scores which 
are not dependent upon the time allowed for the test. This technique could 
appropriately be applied to other kinds of tests. It is no doubt better adapted 
to the measurement of scholastic aptitude, reading ability, and skills than 
to achievement testing in the content fields. 

Devices to insure careful observation of time limits. — Several suggestions 
concerning the formulation of directions so as to facilitate observation of 
time limits were made in an earlier section of this chapter. A further device 
to encourage examiners to give careful attention to the timing of tests is a 
room examiner's report sheet. This sheet summarizes the allotted time for 
the different tests and provides blank spaces where the examiner may record 
the starting time and the finishing time actually employed. The sheet should 
be filled out and signed by the room examiner and then turned in to the 
head examiner of the local school system, who, in turn, should forward it 
to the central agency if the school is participating in a state-wide or nation-' 
wide testing program. To illustrate this procedure a report sheet used in one 
of the testing programs carried on by the member schools of the Educational 
Records Bureau is shown in Figure 6 and a schedule employed in the Na- 
tional Teacher Examinations is exhibited in Figure 7. 

Considerable impetus can be given to accurate timing through the pro- 
vision of better timepieces. A survey of timing procedures in the administra- 
tion of tests by member schools of the Educational Records Bureau indi- 
cated that nearly half of the examiners used ordinary watches and that from 
20 to 40 percent of them did not check their watches for accuracy before 
timing the tests. 


ame of test 

0. of pupils No. of proctors 

eats? Type of desk used by pupils. 

as ordinary or stop watch used? 

.Were pupils seated in alternate 

as manual of directions studied carefully before starting test?. 

16 number of minutes required for the various parts of each test is shown in 
le table below. Immediately after starting each part record the starting 
Lme and the finishing time in columns 4 and 5. (Examiners using stop watches 
5ed not fill in columns 4 and 5. ) 

C. Psych. 

Iowa Reading 




Prog. Ac h. 



nutes Re- 








tired (1936 

Minutes Re- 

Int. or Adv. 


quired (1927 

Min. Required 

Int. Adv. 

Int. Adv. 


I A 5 

I A 3 4 

IVF 10 9 

: 20 

B 5 

B 3 4 

G 12 10 

I 13 

II A 2 

C 3 4 

H 12 10 


B 1^ 
C li 

D 3 4 

V A 3 3 


II E 8 8 

B 2 2 

D 2 

F 5 6 

C 4 5 

III A 3 

G 25 20 

D 5 10 

B 3 

III A 4 5 

E - - 



IV 5 

B 5 5 

F - - 



V A 2 

C 5 5 

B 3 

D 16 15 

VI 2 

IV E 10 9 

regularities (Describe in detail; use back of sheet, if necessary. Include 
Ties of any pupils leaving room during examination and amount of time out.) 

' leral comments. (These will be especially welcome. Use back of sheet if more 
■ Jce is needed. ) 


Fig. 6. — Example of test report sheet to be sighed by room examiner. Used by member 
schools of the Educational Records Bureau. 






February 9, 

February 16, 





to Allow: 

Time Record* 


ro Give 





Hr. ; Mln. 

Hr. ; Mln. 

First Session (Feb. 9) 

1, 2 


Report Cards 


'/erbal Comprehension 




Pari; I (Read.) 

Part II (Vocab.) 


6 "f 

Non-verbal Reasoning 



Part T (Fig. Anal.) 

Part It (Pat. Reas.) 



. . g, . 

Part III (No. Series) 


■ '9 ■■: 



10 1 

Professional Information 


ii i 

Pert I (Ed. Soc. Pol.) 

Part II (Ed. Pay.) 


11 1 

Part ill (fluid, i Anal.l 


11 1 

Part IV (Prln. k Meth. ) 


■ 12 

Second Session (Feb. 9) 

13, 2, 15 

English Expression 



Part I (Mech.) 


Part IT (Effect.) 



General Culture 


Part I 

Sect. 1 


Sect. 2 



■11" ■ 

Sect. 3 






General Culture 



Part II 

Sect. 1 

Sect. 2 



Part III 



Sect. 1 

Sect, i 



Third Session (Feb. 16) 

19 2 

Report Cards 



Option I 


21, 24 


Pert I 

Part TI 



Sect. 1«* 

Sect. 2 



Part III 



Sect. 1»» 

Sect. 2 



Sect. 3 





22, 2, 

Option II 



Part I 

Part il 



Sect, l^i■i 

Sect. 2 



Part III 



Sect. 1«'* 

Sect. 2 



Sect. 3 



«To be filled in by Bnamlner. 
«*If Education In Kleraentary School option Is not being administered, timing by sections of Parts II 
and III Is not required. Total time Part II, 30 minutes; Part III, 40 minutes. 

This Schedule and Time Record Is to be 

signed by Room Examiner and sutmitted ^____^^^__ 

to Local Examiner for forwarding at end 

of last examining session. 

Local Examiner 

Room Examiner 

Fig. 7. — Example of schedule and time record to be signed by room examiner ad- 
ministering National Teacher Examinations. 


Because of the expense involved, few schools would find it feasible to 
provide each staff member serving as an examiner in a group testing pro- 
gram with a stop watch, nor is a watch graduated into fifths of a second 
needed in timing group tests. Some stop watches require rewinding after 
about a half hour's use; and the time in the last ten minutes is likely to be 
somewhat in error. A very satisfactory substitute for, or in some respects a 
distinct improvement over, a stop watch is an interval timer, which is simply 
a miniature alarm clock which rings a bell at the end of any set interval of 
time. Another good illustration is a sports timer, which is simply an ordi- 
nary, inexpensive watch with a start-and-stop device. Using this device, the 
examiner can stop the second hand at zero, set the minute and hour hands 
at 12:00, start the watch when he says "begin," and compute all time for 
the test from twelve o'clock. If it is necessary to take time out for directions 
between parts, the examiner needs merely to stop the watch while he reads 
the directions. 

Experience has shown that this type of watch is especially helpful in 
reducing the examiner's tension and enabling him to do an accurate job. 
Sports timers and interval timers are easily within the budgets of nearly all 
schools. They do not get out of adjustment so easily as a stop watch, and 
with care they will last indefinitely. If a school's measurement department 
were equipped with a sufificient supply of these timers, they could be used 
for all group testing, and their accuracy could be checked under the super- 
vision of the head examiner before they were handed out to the room ex- 
aminers. It is believed that the use of this simple and inexpensive equip- 
ment would significantly increase the accuracy of measurement in school- 
wide testing programs. 

If employing an accurate, ordinary watch of standard make, one may 
say "go" as the second hand reaches 60. One minute later he should test 
the watch to see that it now reads exactly 12:01 as the second hand reaches 
60. Old watches, even of standard make, often have so much lost motion in 
the gears that one may find it impossible after a time to be sure whether 
the time is, say, three minutes or four minutes. Such watches should never 
be used. 

Even with good timers, serious difiEculty is sometimes encountered be- 
cause the timepiece fails during a testing period. To be safe, one ought to 
be able to refer in an emergency to a wall clock or a pocket watch in addition 
to a special timer. 


One of the most intangible of the variables affecting test performance is 
motivation. Since his directions help to condition motivation, the test maker 
should give careful attention to this problem. 


Influence of motivation on test scores. — It is reasonable to believe that 
test scores are sometimes significantly influenced by motivation. 

Glick, in 1925, showed that by taking five forms of Army Alpha, inter- 
spersed with fifteen practice forms, the average college sophomore could 
double his Army Alpha score {27). Courtis, in 1932, by using a wider 
variety of incentives showed an even greater improvement {12). Such in- 
centives, however, would never normally prevail in a testing situation. 

Kirlin {36) carried on a controlled experiment in which Forms A and B 
of the Advanced Compass Survey Tests in Arithmetic were administered to 
131 pupils in grades eight and nine. The two forms were first administered 
without special motivation, with half the pupils taking one form and the 
other half the other form. Three days later the forms were reversed and 
were given with every possible effort to motivate the pupils. In the pre- 
liminary instructions for the second testing, the pupils were told that their 
scores would help to determine their six weeks' grades, and cash prizes 
were offered to the pupils making the largest increases in scores over the 
first testing. Observation indicated that the pupils were keenly interested and 
greatly motivated. The results showed a statistically significant increase in 
mean scores, but on the average not enough to change a pupil's percentile 
rank in the entire group by more than a few points. It is probable, also, that 
practice effect accounted for part of this difference. 

In view of the probable influence of practice and the fact that means of 
motivation were employed that were beyond those which would ordinarily 
be used in a school situation, the gains found by those who have studied the 
question of motivation may be discounted to some extent. The results indi- 
cate that motivation does affect test performance, but perhaps not to as 
great an extent as commonly supposed. This is another area for research 
that needs further exploration. 

Non-test-motivated individuals are commonly believed to exist. Cases of 
such are occasionally produced. In the Army, the test-malingerer is always 
a potential or actual problem. Goldstein devised a scoring procedure for 
detecting malingerers among those taking tests in the Army and showed 
experimentally that this scoring key distinguished between malingerers and 
genuine failures in a large proportion of the cases (28). The theory of a 
potential index of test-taking motivation is simple: 

1. Individuals are not able to gauge very accurately the difficulty of an 

2. Even if they were able, they would not know, without an extensive 
statistical analysis (always wanting), how to regulate their score on easy 
items and on hard items of the same test. 

3. Consequently, if the score on certain easy items is not statistically 


comparable with the score on certain hard items of the same test, the ex- 
aminee may be suspected of wrong test motivation. 

4. Only with optimal motivation will the individual make comparable 
scores on two tests of unequal difficulty but otherwise comparable. 

Even copying is potentially detectable by a greater generalization of the 
above notion. To remove such non-motivated or non-evenly-motivated indi- 
viduals from our test populations would improve our item selection, our 
norms, our validity coefficients, and our statistical comparisons generally. 
It may not be too much to prophesy that shortly we shall not think of giving 
a test without deriving a score of test-taking motivation to tell us whether 
we shall take the test result at full value or only after a retest with more 
nearly optimal motivation prevailing. This is a problem for test authors and 
publishers and is one that they ought to consider seriously in future test 
construction. It is probably fair to say that, in general, modern test makers 
do a competent technical job but often do not take advantage of the oppor- 
tunity to introduce innovations that would greatly assist the test user. 

Hotv much motivation is desirable? — The problem of motivation is un- 
doubtedly complicated by the fact that individuals react to a motivating 
stimulus in different ways and that even before the stimulus is applied they 
are motivated in varying degrees by their individual reactions or "set" to 
the very idea of taking a test. Some pupils approach a test with complete 
lack of interest and considerable boredom and will not put forth their best 
efforts unless some outside motivating influence is present. Others become 
nervous and overexcited in a test situation and thus may find their per- 
formance blocked by their emotional state. The most desirable motivating 
conditions are those which enable the largest number of individuals to turn 
in the best performance without undue emotional stress. 

Ways of standardizing or controlling motivation. — The personality and 
attitude of the examiner, of course, have a great deal to do with the 
motivation of the subjects. With a businesslike but not too severe manner, 
an alert and understanding examiner can arouse interest and, at the same 
time, relieve tension. He can, however, do a better job if he has the aid of 
brief printed directions in which the problem of motivation is recognized. 

One kind of printed material useful in getting all individuals oriented 
to the tests and properly motivated for them is a general statement con- 
cerning the nature and purpose of the testing program which may be 
distributed to the pupils several days before the tests are given. A test- 
making or test-service agency probably should be prepared to furnish 
the schools it services with such a statement. The statement of "Test In- 
formation for Students" which the Educational Records Bureau distribute^ 
to its member schools is as follows : 



The purpose of this sheet is to inform you about the general nature of 
the Educational Records Bureau tests which you will take within the next few 
days. Two kinds of tests will be taken by practically all pupils. One of the 
tests is designed to determine your reading ability; the other is intended to 
measure your aptitude for school work. Your school may be one of those 
that will give still other tests to discover your knowledge of the subjects you 
have studied in school. None of the tests will be used to determine your 
grade or mark. The purpose in giving them is to inform your teachers about 
your ability and your needs so that they can provide the best possible learn- 
ing conditions for you. 

Each test contains a large number of questions calling for very brief 
answers, such as the writing of a word or two, the underlining of a word, 
or the selection of the correct answer from several suggested ones. Do not 
become discouraged if you find a large number of questions which you 
are unable to answer. You are not expected to answer all the questions. Some 
tests are used throughout several grades and, of course, the pupils in the 
lower grades are not expected to answer as many questions as those in the 
higher ones. It is practically impossible for even the most advanced students 
to obtain a perfect score. 

It is advisable to answer some questions about which you are not entirely 
sure. If you think you know the answer, you should put it down even though 
you are not certain, but you should not guess wildly on questions concerning 
which you are totally ignorant. In some tests, in which a certain proportion 
of the wrong answers is subtracted from the correct answers, blind guessing 
may result in a large reduction in one's score. 

There is no passing mark for these tests. The results will be expressed 
in percentiles which will show how you stand in comparison with oth^r 
students who take the same tests in other schools. 

Since time is such an important element, be sure to have at least two 
well-sharpened pencils and an eraser. It is not advisable to use ink because 
of the possibility of the pen running dry and the difficulty of erasing. 

Some tests are to be given with special answer sheets in order that they 
may be scored by means of an electrical machine. If you take one of these 
tests, you will be given a practice test to acquaint you with the way in which 
the answer sheets should be marked. The important thing to remember is 
to use only the special pencil which will be furnished you and to make heavy 
black marks. 


1. Listen carefully to all instructions given by the examiner. 

2. You are not expected to answer all of the questions, but answer as many 
as you can. 

3. Work as rapidly as you can, spending no time puzzling over difficult 
questions. Return to the hard questions if you have time after you have 
gone through the test. 

4. Guess only if you can do so intelligently. Don't guess if you know 
nothing about the question. 


5. Go prepared with two pencils and eraser. If a special pencil is given you, 
use it ojjly. 

6. Do not waste your spare time during the days on which the examinations 
are scheduled, but spend the time constructively as you would during 
any examination period. 

The other kind of printed material useful in helping examiners deal 
with the problem of motivation is the printed instructions which the test 
maker includes in his manual for the examiner to read at the opening of 
the test period. These instructions are well illustrated by two paragraphs 
from the Manual of Directions for Iowa Tests of Educational Development: 

This morning you are going to begin taking a special series of tests. The 
purpose of these tests is to tell us something about your general educational 
development. We want to know what background of ideas you have in the 
social studies and in the natural sciences, how well you can read and interpret 
different kinds of materials, how readily you can solve arithmetical prob- 
lems, etc. This information will help us to discover how well this high 
school as a whole is accomplishing some of the things it is supposed to be 
doing, and will show how it compares with other high schools in these 
respects. More important, what we learn about you from these tests will 
help your teachers to decide how to fit their teaching to your individual 
needs, and will also help us in advising you on your future educational plans. 
The test results can also be of real value to you in working out your own 
educational and vocational plans, and in making up your mind how best to 
distribute your effort in your school work. 

The test results can be very valuable in these ways, but only if each of 
you does his very best on all of the tests. If you do not make a sincere effort, 
the scores will tell a false and misleading story about your abilities, and 
your time spent in taking the tests will be worse than wasted. If you look 
upon these tests as a challenge — as an opportunity for you to show what 
you are really capable of doing — then you are bound to enjoy taking them. 
It is only if you do not try that the tests can become tiresome or uninterest- 

Guessing on objective tests 

The question of what to do about chance successes and guessing has 
plagued test makers from the very beginning of objective testing. After 
more than three decades of experience with multiple-response tests, there 
is still not complete agreement among test specialists concerning this 

Effect of guessing when score is number right. — All types of objective 
questions in which the subject is given a choice between two or more sug- 
gested responses, of course, involve elements of chance and guessing. The 
contributions of chance and guessing to the results of true-false and other 
two-response questions are, of course, potentially greater than the con- 
tributions of these same factors to the results of questions involving a 


larger number of choices. Consequently, test makers have been more con- 
cerned with the effect of guessing on two-response tests than on tests 
involving a larger number of choices, but it is recognized that the problem 
continues to be present as the number of choices is increased, although in 
diminishing degrees. 

The general effect of guessing, where the score is taken simply as the 
number right, is, of course, to raise the scores of the guessers. It is 
apparent that a pupil who knows nothing at all about a subject should 
be able by chance to answer correctly half of the questions in a true-false 
test on the subject. The more he knows about the subject, the less he will 
have to guess, but the theory is that if he guesses at all, he will, on the 
average, get half his guesses right and the other half wrong. By similar 
reasoning, on a three-response test, an individual has one chance of guessing 
right and two chances of guessing wrong; on a four-response test, one 
chance of being right and three chances of being wrong; on a five-response 
test, one chance of hitting upon the right answer by guessing and four 
chances of choosing the wrong answer, and so forth. The number right 
should, therefore, be discounted by W/{n — 1). Hence, the corrected 
score, assuming that all incorrect responses are the result of guessing, is 
S = R — W / {n ~ 1 ) , where R. is the number right, W the number wrong, 
and n the number of choices in each item. 

If all pupils did an equal amount of guessing and if chance always 
operated according to theory, guessing would be of no importance in 
multiple-response tests unless one were interested in turning the scores 
into percentage grades. However, examinees vary greatly in their willing- 
ness to take a chance by guessing at questions about which they know 
nothing, and some will need to guess on fewer items than others. Thus, 
where the score is the number right, individuals will not have exactly 
the same relative standing on a test that they would have had if guessing 
had not been a variable in the scores. 

There is a worth-while distinction to be made between "honest" and 
"dishonest" guessing. During the war many of the tests used in connection 
with the armed forces were scored on the basis of right answers. A common 
practice of men taking the tests who found they could not finish was to 
run rapidly through the remaining items and to mark some response 
without even reading the item. This is distinctly different from "honest" 
guessing based on hunches or half-knowledge. The correction for guessing 
is appropriate for the purely chance "dishonest" guessing, but no correction 
can be altogether appropriate for guessing based on some knowledge of 
the item. In everyday life, persons are continually guessing on the basis 
of partial knowledge, but they hope to guess right a fair share of the 


time. If one who is taking a test can make intelligent guesses in excess 
of the correction for guessing, he is receiving only his just dues. 

This point of view places the correction for guessing in a different 
light. One of the main purposes of the correction is to discourage dishonest 
guessing and to extract a suitable penalty for it. 

Another very important consideration is that the proportion of examinees 
selecting any incorrect response depends upon its plausibility, and that the 
skillful item writer attempts to make each incorrect response as plausible 
as possible to the examiner who does not possess the desired knowledge 
or ability. The preparation of wrong responses, and hence of right 
responses also, to a multiple-choice item is thus dependent upon the 
skill of the item writer. In effect, the item writer attempts to make each 
wrong response so plausible that every examinee who does not possess 
the desired skill or ability will select a wrong response. In other words, 
the item writer's aim is to make all or nearly all considered guesses wrong 
guesses. If the item writer succeeded in this aim, and if all guesses were 
considered guesses, there would be no need for corrections for guessing. 
In such a situation, the standard scoring formula would seriously over- 
correct for guessing. In actual practice, this aim of the item writer is 
never fully realized, but it is doubtless often sufficiently realized that the 
standard formula markedly overcorrects. 

Still another aspect of this problem deserves serious consideration. 
As already noted, the correction for guessing formula assumes that all 
incorrect responses are the result of blind guessing, and that by subtracting 
W/{n — 1) from the number of rights one can get an estimate of the 
number of questions to which the examinee "really knows" the right 
answer. This logic seems fairly plausible as applied to a right-mzsiver type 
of test (see page 196), but it is obviously inapplicable to a best-2inswet 
test. Items of the ^^-jZ-answer type require the student to distinguish be- 
tween a number of responses all of which may be "right" to some degree, 
or none of which may be unqualifiedly correct. There is no "right" answer 
that can be "really known" by the examinee — no77e of the answers may ever 
have occurred to him before taking the test. The task for the examinee is 
one of comparison, and evaluation of all responses, not of recognition of 
a single right answer. For this kind of item, one can hardly discover how 
many "really knew" the right answers by subtracting a fraction of the 
choices of distracters. 

Instructions to pupils concerning guessing. — There are two general pro- 
cedures for dealing with the problem of guessing on objective tests. One 
is either to encourage guessing or to leave decisions concerning whether 
or not to guess to the individual idiosyncrasies of the examinees and then 


to use a formula to attempt to correct for guessing as a part of the scoring 
process. The other is to try to control guessing either by instructing the 
examinees to guess or by telling them not to guess. 

The theory of correction formulas and procedures of using such formulas 
will be discussed in the section on scoring. We are here concerned with 
the instructions which the test maker should prepare for the examiner 
to read to the pupils about guessing. An attempt may be made to eliminate 
guessing by instructing the examinees not to guess. If all individuals 
could follow uniformly an instruction of this kind, perhaps most of the 
problem of guessing would disappear. However, they cannot follow such 
instructions uniformly, since the meaning of the term "guessing" differs 
markedly from individual to individual. Instructions of this kind may 
serve only to magnify the effect of guessing upon the relative standing of 
the individuals in a group, for some individuals will observe the directions 
religiously, while others will disregard them. 

From the standpoint of psychological and statistical theory, a strong 
case may be made out for instructing examinees to attempt every item in 
objective tests. Even if the scores are not corrected for guessing, it is 
felt that instructing pupils to try all items will reduce the differences in 
the scores due to guessing, since it will encourage the less venturesome 
individuals to do as much guessing as those who naturally have a greater 
tendency to take a chance. If a correction formula is applied, and if the 
pupils have followed instructions, the corrected scores will correlate per- 
fectly with the uncorrected (rights) scores. It is true, of course, that 
there will still be some differences among the pupils in the extent to 
which they will follow instructions, but it is probable that, on the whole, 
there will be more uniformity in obedience to a positive instruction of this 
type than to a negative instruction not to guess. 

However, there are two serious objections to this procedure of directing 
all examinees to mark all items. One is that it increases the error variance 
in the scores. If the examinee guesses on some items, his score is in part 
dependent upon his luck in those guesses, which would, of course, vary 
considerably from one examinee to another. If there is no guessing, this 
source of error is eliminated. 

Another and perhaps more serious objection is concerned with the 
psychological effect of instruction to guess upon pupils and teachers. Regard- 
less of the opinion of the psychologists and test specialists on this question, 
many school administrators and teachers object strenuously to the idea of 
directing pupils to attempt every item in a test. They say that among their 
students there is already too much carelessness, loose thinking, and guessing 
and that the educational implications of tests in which this tendency is 


aided and abetted are decidedly bad. Some teachers feel that instead of 
encouraging guessing on objective tests, schools should take vigorous steps 
to discourage it. For example, Count Etoxinod (24) (pseudonym), after 
deploring the lack of respect for accuracy on the part of American students, 
proposed that the penalty for wrong answers on true-false tests be doubled 
in order to discourage guessing and to promote habits of careful thinking. 
Because of the apparent relationship of tests to educational practice, it 
is not probable that test specialists will ever be able to convince school 
people generally of the desirability of instructing pupils to guess when 
they take objective tests. 

Under the circumstances, it seems probable that the most desirable 
statement concerning guessing that a test maker can mclude in his manual 
is a compromise between the guess-and-do-not-guess instructions. This 
kind of direction will carry more weight if the examinees know that the 
scores are to be corrected for guessing. For instance, the Educational Test- 
ing Service uses the following sentence on the cover page of the Cooperative 
Achievement Tests: "You may answer questions even when you are not 
perfectly sure your answers are correct, but you should avoid w/ld guessing, 
since wrong answers will result in a subtraction from the number of your 
correct answers." 

One of the most practical reasons for correction for guessing is found 
in the fact that regardless of what instruction test makers and examiners 
give to the pupils, there may be systematic differences in test performance 
from school to school and from class to class because of differences in 
pretesting instructions given by individual teachers. Some teachers in 
the interest of promoting habits of careful thinking and accuracy will warn 
their pupils against guessing; others anxious to have their own classes 
make a good showing will urge their pupils to attempt all items; still 
others will explain to the pupils how, if they are able to eliminate certain 
choices in a multiple-response item and reduce the chance to a choice 
between two or three items instead of among five, they will gain in the 
long run by intelligent guessing. Such uncontrollable variations in the 
informal instructions and discussions which take place in individual 
classrooms tend to confound the carefully laid plans of examiners and 
test authors. Correction formulas will discourage such practices and, to 
some extent, will compensate for the results of such practices, although, 
of course, it is too much to hope that they will entirely offset them. 

Directions to the examiner and to the subjects 

As Feder [23) has pointed out, the problem of optimum directions 
needs objective investigation. He states that "builders of standard tests 


should recognize the importance of adequate, but not cumbersome, direc- 
tions, and of determining by experimental procedures the best directions 
before marketing their products." In the case of a test printed over the 
years, it is not too much to expect that the directions in successive editions 
will improve. A professional national jury of experts to criticize each test, 
before publication, in this respect, as well as in other ways, would be a 
very useful innovation. 

A study by Weidemann and Newens (69) indicated that the nature 
of the directions may have considerable effect upon test scores. They tried 
out five sets of directions for giving tests involving true-false and in- 
determinate statements and found significant differences in the resulting 

Crrteria for preparation of directions. — The following criteria are among 
those to which special attention should be given when a test maker is 
writing the directions for the administration of his test: 

1. Assume that the examiner and examinees knoiv nothing at all about 
objective tests. Although most teachers and pupils have had experience 
with objective tests, there are still some individuals who have not used 
them. Unless the whole procedure of administering and taking the tests 
is carefully explained, injustice may be done to some individuals, or even 
to groups of individuals, because of lack of understanding of what is to 
be done. The test author should take nothing for granted. 

2. In writing the directions, use a clear, succinct style. Although no 
steps in the administration of the test should be omitted, long and involved 
instructions should be avoided. Feder {25, p. 30) found that clear-cut, 
succinct directions were superior to longer, more detailed directions, for 
the briefer directions tended to avoid the danger of inducing an incorrect 
mental-set, saved time, and were less fatiguing. 

3. Make the more important directions stand out through the use of 
different sizes and styles of type. A common procedure is to use boldface 
type for everything which is to be read to the examinees and italics for 
the time limits. Modern test manuals are, in general, much better arranged 
in this regard than were those of earlier tests. The mechanical arrangement 
and typography can add to, or deter from, the proper test-taking attitude. 

4. Give the examiner and each proctor full instructions concerning luhat 
to do before and after the test is given as well as during its administration. 
The manual should make suggestions about advance arrangements, dis- 
tribution of test materials to examination rooms, instructions to proctors, 
collection of materials at the close of the examination, and return of 
blanks to prevent their falling into unauthorized hands. 

5. Check on all possible misunderstandings and inconsistencies by had'ing 


several examiners' try the directions out experimentally and report their 
procedure in detail together with suggestions for improvement. Occasion- 
ally, inconsistencies in the interpretation by examiners will occur in con- 
nection with instruction manuals carefully prepared under expert super- 
vision. For example, in the older forms of the American Council on 
Education Psychological Examination, which consisted of five parts, 
short instructions were printed in the manual and were to be read by the 
examiner on completion of each part. Time was intended to be taken 
out for the reading of these directions, but this fact was not specifically 
stated in connection with the directions. It was found on the basis of 
replies of schools in a survey made by the Educational Records Bureau 
that in about one-third of the schools time for the reading of these short 
directions was not taken out in either some or all groups taking the test. 
The total time involved was not sufficient to affect the scores of the pupils 
very significantly, but the inconsistency does lend force to the point that 
lack of uniformity in procedure is likely to occur in the administration 
of even a very well-known test unless every aspect of the directions is 
clarified for all examinees. The work-limit test is less susceptible to such 
errors than the time-limit test. 

As a result of their observation of pupils taking tests, examiners or 
proctors can sometimes offer very helpful suggestions concerning details 
which a test maker could not foresee when he wrote the directions. For 
instance, in connection with the administration of the rate-comprehension 
part of the Iowa Silent Reading Test, this direction appears in the manual 
of instructions: ""At the end of one (1) minute say: "Stop!' Put a circle 
around the word you read last, and then continue to read until time is 
called. You will have two more minutes in which to read as much of 
this story as you can. Remember, you are to answer questions about it 
later." One of the users of the Iowa test made the following suggestion 
concerning this direction: "The student is told to put a circle around the 
word he read last and to continue reading. Some will do this, while others 
will listen through the sentences the supervisor has still to read to them, 
thus destroying uniformity. If 'continue to read' were saved to the last, 
and if we were instructed to stop the watch during the reading of the 
directions, the difficulty could be reduced." A longer time limit than two 
minutes also would reduce the relative error involved. 

6. Keep the directions for the different forms of a test, or for the 
various booklets in a set of tests designed to be used in the sa?ne program, 
as nearly uniform as possible. Greater accuracy in the administration of 
tests can be expected if the examiners and proctors can follow one general 
procedure for all the tests in a series than would be true if they were 


required to use a different arrangement of directions and timing witli each 
separate test. A good illustration of the use of one set of directions to 
cover a variety of tests in a series will be found in the Directions for 
Administering the Cooperative Tests, one page of which is reproduced 
in Figure 8. Where the answers are to be recorded in the test booklets, and 
not on separate answer sheets, the directions for administration of any of the 
Cooperative tests may be found on this one page. 

7. Where 'fudged necessary or helpjid, give practice tests before each 
regular test. Such practice tests help to bring about understanding of the 
test situation and often tend to relieve tension under the actual test ad- 
ministration. Although the practice tests usually will not be scored, if 
they are easy and are scored, a low score might be used as an index to 
throw out a test performance as "not sufficiently understood to be allowed 
to stand as the performance of this individual." 

Administration of Tests from the Viewpoint of the Examiner 

A well-prepared examiner should be aware of the multiplicity of factors 
entering into test performance, the kind of environmental conditions, and 
the special problems created by the current tendency to administer tests 
with separate answer sheets. He should also have at hand a list of pro- 
cedures to be followed in preparing for the test, during the test, and 
after the test has been given. 

Factors influencing test performance 

A great many variables influence the test performance of an individual 
pupil. Some of these variables can be controlled during the administration 
of the test, while in the usual test situations others are wholly beyond the 
examiner's control. 

Factors beyond the control of the examiner. — Among the variables which 
the examiner cannot control, or cannot easily control, are chronological 
age; brightness; number of years of schooling; rate of growth; school, 
departmental, and course objectives; content of courses; teaching methods; 
teaching ability; practice effect; and tendencies of different teachers and 
schools to subvert avowed objectives in favor of test objectives. 

In establishing the norms for a test, it probably is good practice to 
discard some test subjects as not being "representative subjects." Toops 
{59) has proposed an ideal of "repHcabihty of a culture after a decade" 
as a criterion for eliminating some classes of examinees, or weighting others. 
Concretely, in 1944, norms of college entrants were decidedly "off-color." 
The college boys in particular were the boys under 18, not yet subject to 
the draft; therefore, presumably much brighter than in a more normal 


Standard Procedure for Administering Tests 
Not Divided into Parts 

1, W'hcii all arc scaled, the txaiiiinci- ^hdiild say 

"We shall now pass out the test booklets. 
Do not open them now. As soon as you get 
the booklet, till in your name and the other 
items of information called for on the cover 
page. Print your name. When you have 
finished filling in the blanks, read carefully 
the directions on the cover page; then wait 
for further directions. Do not open the 
booklet until I tell you to do so." 

2. Allow suiricicnt time for filling in the spaces 
1 the cover page and reading the directions. When 
ich student has done this, the examiner may orally 
nphasize any points that need cmi^hasis, and say: 

"Are there any questions? No questions 
may be asked after the examination 

3. Answer all legitimate questions, and then say: 

"When I say 'Begin,' turn to the first page 
of questions, read the directions at the top 

of the page, and start work. Work as fast 
as you can without makiijg mistakes. Ask 
no questions. Read the directions again if 
you do not understand. You are not ex- 
pected to answer all the questions in the 
time limit. Begin." 

4. Note the exact time when you say "Begin" anrl 
7vrile it do'ivn. Allow exactly the number of minutes 
specified for the test, counting from the moment you 
say "Begin." Do not allow extra time for rcadinp; 
the specific directions inside the booklet. At the 
end of the allotted time, say: 

"Stop! Even if you have not finished, close 
your booklets. See that you have clearly 
printed your name and that you have given 
all the other information asked for." 

5. Have the booklets collected at once. In doing 
so, make sure that all the information necessary for 
identification and classification has been entered. 
Supply any necessary missing items of information. 

Standard Procedure for Administering Tests 
Having Two or More Parts 

Including the Cooperative English Tests, Forms Q Through T 

1. When ail are seated, the examiner should sa>': 

"We shall now pass out the test booklets. 
Do not open them now. As soon as you get 
the booklet, fill in your name and the other 
items of information called for on the cover 
page. Print your name. When you have 
finished filling in the blanks, read carefully 
the directions on the cover page; then wait 
for further directions. Do not open the 
booklet imtil I tell you to do so." 

2. Allow sufficient time for filling in the spaces on 
le cover page and reading the directions. \\"hcn 
ich student has done this, the examiner may orally 
nphasize any points that need em[)hasis, and say: 

"Are there any questions? No questions 
may be asked after the examination 

3 Answer all legitimate questions, and then say: 

"When I say 'Begin,' turn the page to Part I, 
read the directions carefully, and start 
work. Work as fast as you can without 
making mistakes. Ask no questions. Read 
the directions again if you do not under- 
stand. You are not expected to answer all 
the questions in any part in the time limit, 
but if you should finish before time is called, 
go on to the next part. If you finish the last 
I part before time is called, you may go back 
/ and work on any earlier part. Begin." 

4. Note the exact time when you say "Begin" and 
write it down. Allow exactly the number of minutes 
specified for the part of the test which you are 
administering, counting from the moment you say 
"Begin." Do not allow extra time for reading the 
specific directions at the beginning of the part. 

At the end of the allotted time for Part I, say: 

"Stop! Even if you have not finished Part I, 
begin Part II. Read the directions for Part 
II carefully. If you finish Part II before the 
time is up, you may go back and work on 
Part I again, or you may go on to the next 

5. The examiner should see that all students begin 
Part 1 1 promptly. Allow exactly the specified num- 
ber of minutes, then say (if there is a Part 111): 

"Stop! Even if you have not finished Part 
II, begin Part III. Read the directions for 
Part III carefully." 

6. Thus each part of the test is administered until 
all parts have been given. Then say: 

"Stop ! Even if you have not finished, close 
your booklets. See that you have clearly 
printed your name and that you have given 
all the other information asked for." 

7. Have the booklets collected at once. Make 
sure that all the information necessary for identifi- 
cation and classification has been entered. Supply 
any necessary missing items of information. 

Fig. 8. — Directions for administering the Cooperative Tests. Page 2 of an 8-page folder 
issued by the Cooperative Test Division of Educational Testing Service. 


year, say 1940. Norms in 1944 probably would have been more repre- 
sentative if they had been based on women only. Research in this realm 
is a problem for the future. 

Three other factors which may cause considerable variation in the per- 
formance of an individual pupil from time to time are physical fitness, 
emotional state or feeling tone, and motor performance. There seems to 
be no clear-cut evidence in regard to just how much relationship there 
is between physical well-being and test performance. It is possible that 
one's ability to answer test questions is not changed greatly by a headache 
or a cold, but interest in the test and willingness to put forth effort are 
frequently affected, and these in turn influence test scores. 

Instances are on record where test performance has been changed rather 
drastically by the emotional status of an individual pupil. Often the ex- 
aminer is not and cannot be expected to be aware of these subtle causes 
of unreliability in the results for certain individuals, but in cases where 
pupils deviate markedly from other data concerning them, the possibility 
of emotional upsets at the time the tests were taken should be investigated 
as a possible contributing cause. Such cases should be retested. 

As one would expect, there is evidence that motor performance is a 
factor in scores on a timed test designed to measure abilities other than 
motor skills. Prichard (4^) found that in a typical rate test a significant 
change in writing speed was accompanied by a significant change in score. 
Motor performance can probably be somewhat, but not wholly, controlled 
by means of the directions and conditions of the test. Other things equal, 
the smaller the reaction (check marks) and the less the writing, the 

Factors tuhkh can be controlled by the examiner. — In addition to the 
variables mentioned, the testing conditions themselves involve a large 
number of variables, and these are, to a considerable degree, under the 
control of the examiner. It is' not known how much influence lighting, 
time of day, size of group, and so forth, have upon test scores, but these 
may be important, at least for certain individuals. 

The influence of motivation on test performance was mentioned in an 
earlier section of this chapter. There is unquestionably a difference among 
individuals in the attitude and effort they exhibit in a testing situation. 
There also seems to be a difference in the motivation shown by the same 
individual at different times. This unevenness of motivation, if it exists 
in marked degree, can significantly reduce the validity of test results. 
Without doubt, the difference in motivation from group to group and 
from time to time for the same individual is due partly to the fact that 
in a school-wide testing program where a large number of teachers are 


used as examiners and proctors, some teachers take an interest in the 
assignment and do it well, whereas others are bored by and indifferent to 
the entire testing program, and by their example they destroy the morale 
of the pupils to whom they give the tests. An interested, businesslike, 
efficient examiner who is punctual and firm without being too severe can 
do mucli to bring about optimum motivation among his examinees. 

A question related to motivation concerns whether or not pupils should 
be warned of a coming test. It seems to be the practice in most schools 
to inform pupils in advance that they are to take a test, and there is some 
evidence that this practice is desirable. Tyler and Chalmers {67) found 
that the average scores of junior high school pupils were increased slightly 
(approximately 1-2 percent) by giving specific warning of the date of 
the examination two days before it was to be administered. 

The seating arrangement in the testing room should be carefully 
planned both to make the conditions as comfortable as possible for the 
pupils and to reduce opportunities for copying. Where the size of the 
room permits such an arrangement, the use of alternate seats, and even 
alternate rows, is preferable. Individuals otherwise honest are likely to 
resort to cheating or collusion if a feeling of insecurity over the content 
of the test is coupled with strong motivation. Fenton {26) found, for 
example, that when given an opportunity to cheat in three experimental 
situations, 63 percent of a college class of girls did cheat in one or more 
situations. Bird {3) and Dickenson {19, 20) have suggested procedures 
to be employed in studying the number of identical errors in the papers 
of pupils suspected of cheating. It may be necessary to resort occasionally 
to such procedures after tests have been given, but obviously the pre- 
ferable arrangement is to eliminate opportunities for cheating by careful 
control of the test conditions. Having the seats numbered and requiring 
the examinees to record their seat numbers may be a deterrent to cheating. 

Environmental conditions of the test 

In planning the physical conditions of a testing program, the head 
examiner in each school, of course, has to work within the framework of 
a particular environment. In some school buildings it is possible to set 
up almost ideal testing conditions, whereas in others the conditions will 
be imperfect and will vary from group to group even with the best of 
planning. It is advisable to give special attention to obtaining the best 
possible space, lack of crowding in the seating arrangement, adequate 
light, comfortable temperature and ventilation, and freedom from dis- 

In the average school, the rooms available for testing will probably 


include the auditorium, the library-study hall, the school cafeteria, the 
science laboratory, and classrooms constructed and equipped to handle 
classes of twenty-five to sixty pupils. One procedure for handling the 
testing program is to follow the usual school schedule, having the tests 
administered in the classrooms by the regular teachers. This plan has the 
advantage of confining the testing to comparatively small groups where 
it is relatively easy for the examiner to make the directions clear to all 
pupils and to keep everyone under his personal observation. It is also 
probable that administration of tests in the regular school routine by 
teachers whom they know is less disturbing to nervous pupils than the 
carrying-out of a special testing schedule. 

On the other hand, the use of the regular schedule has several dis- 
advantages. In the first place, since the personality, attitude, and pro- 
cedures of the different examiners constitute a variable in test administra- 
tion, the larger the number of teachers serving as examiners, the greater 
the opportunity for the variable to be a cause of differences in test scores. 
In the second place, some good teachers are constitutionally poor examiners, 
and one can be sure in advance that they will not do a good job, but they 
cannot be readily eliminated if the regular schedule is followed. Thirdly, 
overcrowding and opportunities for copying are likely to occur in the 
testing of large classes even when pupils are required to move their chairs 
as far apart as possible. 

A fourth disadvantage of following the customary schedule is that dif- 
ferent class groups taking such a test as English are likely to be meeting 
throughout the day. The pupils in classes tested near the end of the day 
have an opportunity to obtain a certain amount of information and help 
in connection with the test from pupils who have had it at an earlier 
hour. Moreover, some teachers who are testing their own classes may be 
so eager for them to do well that they will yield to temptation to offer 
a few indirect suggestions which will help the pupils obtain higher scores. 

Everything considered, the planning of a special schedule for a testing 
program usually is advisable, although in the primary grades, testing 
with regular classes may be preferable. Under a special schedule, the 
services of the teachers who are potentially the best examiners can be 
utilized to the fullest extent, a given test may be scheduled at the same 
hour for all groups, and the best of the school's equipment can be utilized 
for the testing. A school "holiday" for testing is probably more justifiable 
than most of the "days" for which holidays are called. 

Research is needed on the optimum size of groups to be tested. In the 
absence of objective evidence, it is not possible to say just how large 
such groups should be, and, in any event, considerable leeway would be 


necessajy, for the most desirable group size no doubt depends partly upon 
the age of the pupils, the ability and experience of examiners in handling 
large groups, the acoustical properties of the testing room, and the avail- 
ability of staff members to serve as proctors. If other things are equal, 
it usually is better to plan for a fev^' rather large groups than for many 
small ones. When everything is well planned and the room is thoroughly 
proctored, groups of at least 300 can be tested at one time very successfully 
by an experienced examiner. 

As a rule, the best single room for testing in a school is the library-study 
hall. This room is likely to be equipped with good desk space and to be 
well lighted. Some school cafeterias, with their arrangement of tables and 
chairs, offer very adequate writing space for testing. It is well to keep 
in mind, however, that often the acoustics in such rooms are not the best. 
School auditoriums usually are not very satisfactory for test administration 
because of lack of sufficient desk space. However, those in which the 
seats are equipped with desk arms which may be raised into place for 
writing are sometimes suitable for the administration of tests contained 
in small booklets where all the writing is done in the booklets themselves. 

Classrooms equipped with desks, particularly the larger rooms, may 
also be utilized in the testing program. Rooms with desk-arm chairs should 
be eliminated from the schedule if possible, particularly where the tests 
have separate answer sheets. 

Reference may be made once more to a point mentioned in an earlier 
section — that it is desirable for test makers to try to standardize the con- 
ditions under which their tests are administered. Little has been done in 
this direction thus far, and it is true, of course, that as long as many 
school buildings are inadequate for testing purposes general progress in 
this direction will be slow. Nevertheless, examiners can themselves do 
much to improve the reliability of test results by standardizing as far as 
possible the local environmental conditions under which tests in their 
own schools are administered. 

Administration of tests for machine scoring 

Special attention should be given to the equipment of rooms in which 
tests with separate answer sheets are to be administered, particularly those 
which are to be machine-scored. Since each examinee must handle an 
answer sheet as well as a booklet, more desk space is desirable. However, 
in some large institutions where conditions make it impossible to use large 
desks in the administration of tests, fairly satisfactory results have been 
obtained through administration of the tests to all students in a large 
auditorium under conditions where each student holds a board on his 


lap. This procedure tends to bring about comparability within the group 
even though the results may be somewhat too low as compared with the 
norms. The surface of the desk, or other writing space, should be smooth 
and hard in order that the subject may make heavy, black marks without 
punching through the answer sheet with his pencil and without too much 

When tests are administered with separate answer sheets, occasionally 
an individual will lose his place and mark a series of answers one place 
too high or too low on the sheet. This mechanical error, if undiscovered, 
may lower his score significantly. To forestall this type of error, Taylor 
(JJ) has recommended a simple device which consists in supplying each 
examinee with a blank sheet of paper, 8l/^ by 11 inches, to be used 
both as scratch paper and as a means of marking his place. He suggests 
further that one side should be printed with examination instructions so 
that the examinee will use only one side for notes and thus probably will 
not transfer graphite from the guide sheet to the answer sheet. 

There is evidence that the type of desk influences to some extent the 
scores on tests given with separate answer sheets. Traxler and Hilkert (66) 
compared the mean scores made on the machine-scoring form of the 
American Council Psychological Examination by seven pairs of groups 
of secondary school pupils. Five pairs of groups were selected at random, 
and two pairs were matched on the basis of Otis IQ. One group of each 
pair took the test at desks, and the other group took it in chairs with 
desk arms. All the differences in mean scores were in favor of the desk 
group, although only one was as much as four times its probable error. 
Kelley (34) applied additional statistical techniques to these data and 
showed that the difference was clearly significant at the ninth-grade level 
and that probably the desk group continued to have an advantage in 
the upper years of high school. These results suggest that if test adminis- 
trators wish to make sure that pupils have the advantage of optimum 
conditions when taking tests with separate answer sheets, they should give 
these tests in rooms having desks instead of chairs with desk arms. 

Perhaps the greatest single problem in machine-scoring tests is that 
a large proportion of the papers often have to be re-marked before they 
will score properly in the machine. Notwithstanding the reading of special 
instructions and the use of a short practice test to illustrate the method, 
many pupils do not understand the necessity of making heavy, black marks 
on their answer sheets with the special pencil and the need for going over 
each mark more than once. In the case of some groups of papers received 
for scoring at the Educational Records Bureau, the staff found it necessary 


to re-mark almost every paper before beginning to score them. This extra 
work tends to create a bottleneck in a testing program with the net result 
that, despite the potential rapidity of machine scoring, reports of the 
results may be returned to the teachers little, if any, faster than would 
be the case if hand scoring were used. 

More important than the speed of scoring is the possible influence of 
wide variations in the excellence of the marking of machine-scorable 
answer sheets on the validity of the test data for instructional and guidance 
purposes. There is definite experimental evidence that the mean score of 
a group of answer sheets marked lightly or marked in a slovenly manner, 
with many stray dots on the papers, differs significantly from the mean 
score on a group of well-marked answer sheets {61). On an occasional 
extremely poorly marked answer sheet, the score obtained by machine 
may be at least 25 percent lower than it should be. 

Under these conditions the scoring department is faced with a dilemma. 
If the answer sheets are scored without re-marking, the pupils with poorly 
marked sheets will be penalized. This type of penalty may be an effective 
disciplinary technique, but is an unsound testing technique. On the other 
hand, if the sheets are scanned and marked so that they will score properly, 
the conscientious pupils who follow instructions may indirectly be penalized 
on time-limit tests, for it takes longer to make a heavy, black mark and 
to go over it several times than it does to make a single light mark. 

The problems of machine scoring are not those of scoring alone; they 
are problems of administration as well. Examiners must cooperate fully 
with scoring departments in order to make machine scoring accurate and 

One procedure for improving the marking of answer sheets is to provide 
each room examiner with a sample well-marked paper for exhibition. It 
is not sufficient merely to duplicate a marked answer sheet by the multilith 
or photo-offset process. A duplicated answer sheet loses some of the es- 
sential qualities of a sheet that will score correctly when inserted in the 
machine. Each sample answer sheet should be made up by hand with 
marks that are heavy, black, and glossy. The room examiner should pass 
the sample sheet around before the test is begun and should caution the 
pupils that if they do not mark their own answer sheets equally well, they 
will run the risk of receiving scores significantly lower than their true ones. 

Proctors should be instructed to move about the room during the 
examination and to caution individually any pupils they observe who are 
not marking their answer sheets heavily enough or who are carelessly leav- 
ing stray dots and marks on the papers. It is only through constant vigilance 


on the part of examiners and proctors that the influence of this variable 
on the results of machine-scored tests can be eliminated. The number of 
stray marks on the papers may be reduced if each examiner will suggest 
to the pupils that they rest their pencils on the question numbers as they 
work through the test. 

Procedures to he followed by the examiner^ 

One member of the staff of a school should have full responsibility 
for the administration of the testing program. He should carefully plan 
and carry out all the details of the entire program each year. The follow- 
ing rules for administration of a testing program are based on experience 
and have been applied successfully. 

Preparatory. 1. Select the tests carefully, preferably in cooperation with 
a faculty committee. If the school is located in an area where there is a 
state testing program, consider carefully the tests recommended for that 
program, since they are usually selected by experts in measurement and 
guidance. Take into account the tests recommended by testing organizations 
of national scope, particularly when the recommended tests are chosen by 
committees of administrators and teachers representing different types of 

2. Order the tests v/ell in advance of the date on which they are to be 
used. Allow ample time to get all materials in readiness before the date 
on which the tests are to start. Check quantity of tests immediately upon 
receipt, and if more are needed reorder at once. 

3. Plan in detail for the administration of the tests. Choose examiners 
and proctors with great care. If possible, use examiners who have had previ- 
ous experience giving the objective type of test. If inexperienced examiners 
must be used, they should be carefully rehearsed beforehand. Remember 
that some very intelligent people are temperamentally unsuited to the exacting 
routine of administering a test. You may use such persons as proctors for 
tests being given to larger groups, but they should not be placed in charge 
of the administration of a test. 

4. Duplicate an examination schedule, and see that every person concerned 
receives a copy. The schedule should give the time and place of each test, 
indicate just where each class which is to take the test is to go, where the 
pupils who are not taking the test should be during that time, what material 
the pupils will need when taking the test, and the name of the faculty 
member in charge of each examination. 

5. Avoid overemphasis on the tests. Urge the teachers to have the pupils 
take them "in stride." 

6. Give pupils who have never taken objective tests an opportunity to 
examine obsolete editions of tests of the same kind. Better still, have them 
take a short practice test of the objective type. 

^ Part of this section is quoted from Traxler (6^,, pp. 137-59). 


7. Do not distribute the tests to the examiners before the day of the 
examination. Have packaj^es containing the requisite number of test booklets 
and all accompanying materials made up and ready for the examiners when 
the day for the tests arrives. Do this sufiiciently in advance that any missing 
items can be supplied. 

8. Provide each examiner with a manual and a sample copy of the test 
several days before the examination and urge him to study the manual and 
to practice by taking the test himself. Aios/ errors in the administration of 
tests are caused by failure of the examiners to prepare sufficiently beforehand. 

9. Provide each examiner and proctor with a written set of instructions 
outlining his duties at all stages of the examination. 

During the test. 1. Make arrangements so that there will be no interrup- 
tions or distractions during the testing period. Persons should not come 
into, or go out of, the room unless absolutely necessary. This is doubly 
important with timed tests. 

2. Seat the pupils in alternate chairs if possible. 

3. See that each proctor understands what is expected of him before, 
during, and at the end of the examination. The examiner should circulate 
among his proctors and keep them alert to their duties. 

4. Make announcements slowly and clearly in a voice that is loud enough 
to be heard throughout the room. Assume a businesslike and efficient 
attitude that will command attention, but do not be unnecessarily severe. 
Remember, some pupils become nervous when faced with an examination. 

5. Have proctors supply all pupils with booklets and pencils and with 
answer sheets, if the tests are to be administered with separate answer 
sheets. Announce that the pupils are not to write in the booklets nor to 
open them until so instructed. 

6. Have the blanks on the front of the booklets, or answer sheets, filled 
out. Be sure to announce the date, specify how names are to be written, 
and explain other items that may need clarification. Spend sufficient time 
on this step to see that the information is given correctly by the pupils. 
Ages and birth dates are especially important on tests of academic aptitude, 
for these determine what norms are to be employed. 

7. Hold faithfully to the exact wording of the printed directions unless 
there is an excellent reason for introducing a minor variation in them. The 
preparation of directions for a test is one aspect of test construction and 
standardization. The wording of the directions has been carefully thought 
out by the test author. Do not improvise or introduce short cuts. If you do, 
you may change the test results significantly. 

8. Time the examination with extreme care, using an interval timer or a 
watch which has a second hand and which has been checked for accuracy. 
It is advisable to have one of the proctors check your timing to be sure that 
no error occurs. In many tests, accurate timing is the most important single 
feature of the entire procedure of administration. The proctor should warn 
the examiner if he gives signs of neglecting his duty, but obviously the 
examiner must not depend on this warning for his signals. 


Timing technique, if ordinary watch is used 

1) Synchronize second hand so that it hits 60 when minute hand is 
exactly on a mark. 

2) On saying "Go!," look at second hand and re'<:(9;'^.- :37 

3) Then look at minute hand and record: 56:51 

4) Then look at hour hand and record: 10:56:37 
5 ) Suppose time limit is 7% minutes : 


equals 10:64:07 
equals 11:04:07 
6) Glance at watch every minute or two until 11:02 or 11:03. From 
11:02 or 11:03, glance at it every 10 or 15 seconds until 11:03:30. 
Then look at it continuously until 11:04:07, when time is called. 

9. Move about the room occasionally to see that all pupils are working 
on the right part of the examination, but do not stand gazing over a pupil's 
shoulder until he becomes self-conscious, and do not constantly move 
nervously from pupil to pupil. 

10. Do J70t use the test situation to inculcate good disciplinary habits. 
The single object of discipline in the test room is to keep everyone working 
at his maximum all the time, with a minimum of disturbance from all 
sources, including the examiner and proctors. Use gestures, facial expres- 
sions, soundless whispers, and so forth, in dealing with examinees during 
working period. Make it clear that no questions will be answered during 
working periods. If hand is raised, smile and shake head. If anyone speaks 
aloud or makes semi-audible signs of frustration, smile and put finger to 
lips; if this persists, frown; if a serious disturbance seems imminent, 
remove disturber from test room quickly and quietly, and make an appoint- 
ment to clear up the trouble later. Any disciplinary measure which disturbs 
the group is just as bad as any similar disturbance by an examinee. 

11. Stop the examination immediately when the time is up and collect 
the booklets. 

After the test has been given. 1. As soon as a certain test has been given, 
have all proctors turn in their booklets promptly. Alphabetize and check the 
papers against the class list. 

2. Except in cases of protracted illness, see that all examinees make up 
the examination. This is a bothersome step, but one that is unavoidable, for 
complete data are essential if the results are to be used successfully in either 
teaching or guidance. 

3. See that the tests are scored promptly. Report the results to the faculty 
in a form that they can use and provide them with an explanation of the 

4. Have the scores of each pupil entered on an individual cumulative 
record card and make this card available to both counselors and classroom 


teachers. The card may also be shown to parents if the data are carefully 
explained in conference. Mature pupils, especially those from the junior 
year of high school upward, may likewise be shown the results of their 
tests during interviews. 

Scoring of Objective Tests 

Among the factors to be taken into account in the scoring of objective 
tests are the scoring formula to be employed, the vi^eighting of items and 
parts of the test, the kinds of provision for responses, and the types of 
keys to be used. Other important factors are the question of whether the 
responses are recorded in the test booklets or on separate answer sheets, the 
question of whether the tests are to be scored locally in a comparatively small- 
scale scoring organization or centrally in a large-scale program, the question 
of whether the answer sheets are to be hand-scored or machine-scored, the 
organization of the scoring unit, and the use of special machine equipment. 
The responsibility for some of these matters rests with the test maker, while 
for others it is centered in the scoring department. Decisions of test authors 
in regard to procedures are subject to change by the supervisors of the scor- 
ing, if better procedures can be devised provided that they do not affect the 
scores obtained. 

Correction for Guessing 

One of the first decisions which must be made about the scoring is 
concerned with whether the score is to be the number right or whether 
a formula to correct for guessing and chance factors is to be employed. 
The generalized formula for correcting for guessing is 


S = R- , 

where S = score, 

JR = the number of right responses, 

W ~ the number of wrong responses, 

n = the number of suggested responses for a single item, 

K = the number of responses to be selected or marked for each 


Scoring by this formula involves the assumption that every wrong 
response is the result of a guess, that all wrong responses are equally 
attractive or equally likely to be selected, and that therefore the law of